Advertisement

Performance: Fastest quad drawing

Started by May 07, 2019 12:20 PM
6 comments, last by BillyTheFisherman 5 years, 8 months ago

I am currently generating quads in my vertex shader by having the vertex data in an SRV and generating the quads using SV_VertexId i.e null index and vertex buffers are bound to the draw call along with no input layout.

This allows me to have a quarter of the bandwidth given to the index and vertex data by using a single index to the quad vertex and reusing that quad vertex for all four vertices required (see below).

The problem I have is that I am currently invocating the VS 6 times for each quad because I am currently using tri-lists.  I can reduce the VS invocation to four by either going back to using an index buffer (and so incurring 4x memory bandwidth for the indices) OR I could move to using quad lists (instead of tri lists) with the hope that the VS will only be executed four times because the hardware has automatically broken the quad into two triangles under the hood.

Nowadays is there any performance overhead between using a quad list vs a tri list?  How efficient is the hardware at breaking down quads into triangles?

Does using the quad list primitive over the tri list primitive use the post transform cache or will it effectively do the below logic by invocating the VS 6 times?

I'm asking this question before I go on to write performance tests in case anybody knows the definitive answer.

(Ignore that I'm loading full 32bit index, position and size atm - I will be reducing this!)


static const int g_quad[6] =
{
    0, 1, 2, 2, 1, 3
};

Buffer<uint>             g_indices        : register(vs, t0);
ByteAddressBuffer        g_quads          : register(vs, t1);

psDepth VS_Quad(uint vertexId : SV_VERTEXID)
{
    uint quadIndex      = vertexId / 6;
    uint vertexIndex    = g_quad[vertexId % 6];

    uint quadAddress   	= g_indices[quadIndex] * PRIMITIVE_SIZE;

    uint4 vertexdata0   = g_quads.Load4(quadAddress);
    uint4 vertexdata1   = g_quads.Load4(quadAddress + 16);	// I have other data stuffed in yzw not shown here
  
    uint3 mask			= uint3(vertexIndex & 1, (vertexIndex & 2) >> 1, 0);
    uint3 invmask		= !mask;
	
    float3 position		= asfloat(vertexdata0.xyz);			// I can pack vertex data into smaller values than 32bit and will do!  
    float3 size			= float3(asfloat(vertexdata0.w), asfloat(vertexdata1.x), 0.0f);
  
    float4 screen_pos 	= float4((position * invmask) + ((position + size) * mask), 1.0f);

 

How many triangles are we talking about here and why is it a performance concern?

Advertisement

Thanks for the reply Phongy it is very much appreciated. 

Let's assume this is an amount of triangles you would deem a lot and that its a use case where performance is a concern.  For example this type of scenario can apply to particle systems, decals (boxes instead of quads), occlusion culling (again boxes) and any number of other applications, UI's, sprite systems etc etc.

I mean, without a use case it's really hard to answer.

The general idea though is to use triangle list/strips and do as few "Draw Primitive" calls as possible. Graphics cards are really good at munching through high numbers of triangles so its more about how you deliver them.

 

The first thing I'd say is that unless you've extensively profiled this scenario you're unlikely to know whether reducing the VS invocation count from 6 to 4 is going to make any difference to your performance. If it's not the bottleneck then chances are it'll make little-to-no difference. It sounds like you made a choice to optimise for reducing bandwidth before knowing that was a problem and now you have another potential problem as a consequence of that premature optimisation.

Without resorting to Instanced draws, you could easily precompute a 16-bit index index buffer that contains enough indices to draw 16,384 quads and it only adds an average of 12 bytes of extra bandwidth per quad (3 bytes per vertex invocation). This is not a significant amount of data by today's standards. A modern GPU has in the region of 200-600GB/s of bandwidth available but only peak vertex throughputs of around 2-8B vertices per second, so you're adding a pretty small amount of extra bandwidth in order to reduce your vertex invocations by 33% (6 -> 4)!

Instancing is another option, whereby each quad is its own instance. The index buffer then only needs to contain {0,1,2,2,1,3} and this will quickly be cached by the GPU and not get repeatedly read from DRAM for each instance - thus it generates essentially zero extra bandwidth. What you need to be careful about is that not every GPU from the last N years is capable of packing vertices from different instances into the same GPU wavefront. If a 16/32/64 thread wavefront is only running 4 vertices then you're going to be giving away considerable performance. What I can't remember off the top of my head is what hardware suffers from this problem, so it's worth testing for yourself.

You also mention Quad-Lists, which are not available in DX11 or DX12. Last time I checked there was only one IHV with a DX11 extension to support this. You didn't explicitly say you were using DirectX, but you're using D3D terminology such as 'SRV', so I assumed you are.

For your stated use cases (particles, decals, UIs, Sprite systems) you're unlikely to see much benefit from this sort of optimisation anyway. Particles are best drawn using more circular looking primitives such as octagons as it's much more beneficial to reduce your pixel count by 22.22% than it is counter-productive to increase your vertex count from 4 to 8. If you ever find yourself drawing geometry where the number of vertices is close to (or exceeding) the number of pixels then the solution is to reconsider why your triangles are only 1-2 pixels in size.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Starting from page 15. 

Advertisement
On 5/7/2019 at 11:19 PM, Adam Miles said:

The first thing I'd say is that unless you've extensively profiled this scenario you're unlikely to know whether reducing the VS invocation count from 6 to 4 is going to make any difference to your performance. If it's not the bottleneck then chances are it'll make little-to-no difference. It sounds like you made a choice to optimise for reducing bandwidth before knowing that was a problem and now you have another potential problem as a consequence of that premature optimisation.

Without resorting to Instanced draws, you could easily precompute a 16-bit index index buffer that contains enough indices to draw 16,384 quads and it only adds an average of 12 bytes of extra bandwidth per quad (3 bytes per vertex invocation). This is not a significant amount of data by today's standards. A modern GPU has in the region of 200-600GB/s of bandwidth available but only peak vertex throughputs of around 2-8B vertices per second, so you're adding a pretty small amount of extra bandwidth in order to reduce your vertex invocations by 33% (6 -> 4)!

Instancing is another option, whereby each quad is its own instance. The index buffer then only needs to contain {0,1,2,2,1,3} and this will quickly be cached by the GPU and not get repeatedly read from DRAM for each instance - thus it generates essentially zero extra bandwidth. What you need to be careful about is that not every GPU from the last N years is capable of packing vertices from different instances into the same GPU wavefront. If a 16/32/64 thread wavefront is only running 4 vertices then you're going to be giving away considerable performance. What I can't remember off the top of my head is what hardware suffers from this problem, so it's worth testing for yourself.

You also mention Quad-Lists, which are not available in DX11 or DX12. Last time I checked there was only one IHV with a DX11 extension to support this. You didn't explicitly say you were using DirectX, but you're using D3D terminology such as 'SRV', so I assumed you are.

For your stated use cases (particles, decals, UIs, Sprite systems) you're unlikely to see much benefit from this sort of optimisation anyway. Particles are best drawn using more circular looking primitives such as octagons as it's much more beneficial to reduce your pixel count by 22.22% than it is counter-productive to increase your vertex count from 4 to 8. If you ever find yourself drawing geometry where the number of vertices is close to (or exceeding) the number of pixels then the solution is to reconsider why your triangles are only 1-2 pixels in size.

Good afternoon Adam, ok in answer to your first paragraph Ill explain the situation which admittedly is slightly odd.  I'm carrying out a mini project at home to investigate a problem I'm finding at work for which we don't have time to play around with at work (because we have bigger fish to fry) but has popped up in a few locations and has piqued my interest that of being vertex shader bound in some parts of our frame (kind of like those mini projects like you guys do in ATG ? ). 

At home I do not have the project and assets and I'm trying to use Microsoft's PIX for Windows which as you know doesn't have all the swanky deep level counters etc that Microsoft's Xbox PIX does (or Sony's Razor for that matter).  Nvidia gives you some high level counters from which you have to kind of guess what types of bottlenecks you're facing and there are no docs for the GPU unit acronyms they use.  This is as much an experiment to see if I did 'X' what are the types of problems that pop up and what new types of bottlenecks do I press up against.  With your deep low level knowledge I actually hoped you would answer me here to shed a bit of light on some of my thoughts and directions I was thinking of taking.

I'm going to muddy the bandwidth problem in a moment (spoiler: async compute) but first I just wanted to expand on the bandwidth vs vertex throughput to make sure I've understood you correctly.  Essentially you're saying a modern GPU with peak vertex output would chew through 24GB-96GB just for the index data assuming the B in "2-8B vertices per second" stands for billion.  That to me is still quite a bit of bandwidth on top of the actual vertex information and any other bandwidth texture reads/writes going on at the same time - peak performance figures over a second are a bit misleading if you've got a specific problem at a specific point in the frame. 

I have to say I'm not 100% sure bandwidth is actually our issue and the problem is not residing in the post IA vertex cache or the post VS vertex cache at work.  What I'm focusing on here is mostly due to the AMD presentation (posted here that I read quite some time ago) and that is to remove those caching bottlenecks from the equation and to focus on the bandwidth issue. 

Thanks for pointing out the fact D3D12 doesn't support quads I just had assumed it did!  Is there not an argument these should be supported given the potential index bandwidth reduction?  I'm sure the driver could take some kind of short cuts in this instance.  Quads are used by a lot artists as a basic primitive type in their art packages and there are quite a few applications where quads (or multiple quads for cuboids) are a more convenient primitive.  Anyway it cancels out one avenue of investigation so that helps!

From what I've seen at work it looks like we're bandwidth bound as we're running async GI calculations at the same time as our occlusion queries/decals (both rendering a lot of boxes).  As for the 22.22% reduction in pixel work over a 400%(?) increase in vertex work is a win I'm more on the fence about, as we can have (on certain hardware at least) max 2 vertex wavefronts per SIMD unit vs 10 pixel wavefronts per SIMD unit - typically we only see 1 vertex wavefront vs 6 or 7 pixel wavefronts.  This really does depend on particle system though few big particles vs many small particles.

Anyway thank you for your detailed reply. Many, many thanks!

This topic is closed to new replies.

Advertisement