I am currently generating quads in my vertex shader by having the vertex data in an SRV and generating the quads using SV_VertexId i.e null index and vertex buffers are bound to the draw call along with no input layout.
This allows me to have a quarter of the bandwidth given to the index and vertex data by using a single index to the quad vertex and reusing that quad vertex for all four vertices required (see below).
The problem I have is that I am currently invocating the VS 6 times for each quad because I am currently using tri-lists. I can reduce the VS invocation to four by either going back to using an index buffer (and so incurring 4x memory bandwidth for the indices) OR I could move to using quad lists (instead of tri lists) with the hope that the VS will only be executed four times because the hardware has automatically broken the quad into two triangles under the hood.
Nowadays is there any performance overhead between using a quad list vs a tri list? How efficient is the hardware at breaking down quads into triangles?
Does using the quad list primitive over the tri list primitive use the post transform cache or will it effectively do the below logic by invocating the VS 6 times?
I'm asking this question before I go on to write performance tests in case anybody knows the definitive answer.
(Ignore that I'm loading full 32bit index, position and size atm - I will be reducing this!)
static const int g_quad[6] =
{
0, 1, 2, 2, 1, 3
};
Buffer<uint> g_indices : register(vs, t0);
ByteAddressBuffer g_quads : register(vs, t1);
psDepth VS_Quad(uint vertexId : SV_VERTEXID)
{
uint quadIndex = vertexId / 6;
uint vertexIndex = g_quad[vertexId % 6];
uint quadAddress = g_indices[quadIndex] * PRIMITIVE_SIZE;
uint4 vertexdata0 = g_quads.Load4(quadAddress);
uint4 vertexdata1 = g_quads.Load4(quadAddress + 16); // I have other data stuffed in yzw not shown here
uint3 mask = uint3(vertexIndex & 1, (vertexIndex & 2) >> 1, 0);
uint3 invmask = !mask;
float3 position = asfloat(vertexdata0.xyz); // I can pack vertex data into smaller values than 32bit and will do!
float3 size = float3(asfloat(vertexdata0.w), asfloat(vertexdata1.x), 0.0f);
float4 screen_pos = float4((position * invmask) + ((position + size) * mask), 1.0f);