WriteBufferImmediate use-cases

Author

545

December 05, 2017 03:34 PM

I am doing a DX12 graphics wrapper, and I would like to update constant buffers. I found the ID3D12GraphicsCommandList2::WriteBufferImmediate method, which is apparently available from a Windows 10 Creators update only. I couldn't really find any info about this (and couldn't try it yet), am I correct to assume this would be useful for writing to constant buffers without much need to do synchronization? It seems to me like this method copies data to the command list itself and then that data will be copied into the DEFAULT resource address which I provided? The only synchronization needed here would be transition barriers to COPY_DEST before WriteBufferImmediate() and back to GENERIC_READ afterwards? I could be totally off though, I'm still wrapping my head around a lot of things.

What other use cases would this method allow for?

Wicked Engine

SoldierOfLight

2,378

December 05, 2017 05:29 PM

At a high level, no that is not its intended use. Using MODE_DEFAULT would (probably) cause the graphics pipeline to stall/drain every time you issue one of these writes, which would kill performance. Using either of the other modes could cause the writes to happen too soon (affecting draws already in flight) or too late (after all previous draws in flight are fully finished, not necessarily in time for the next one).

Its intended use is for checking progress of GPU execution, specifically when the GPU has faulted and the device has become removed. If you use WriteBufferImmediate to insert "breadcrumbs" at the top of pipe and bottom of pipe, and the GPU faults, you can inspect these breadcrumbs to see which workloads had started but not finished - i.e. which workloads could have possibly contributed to the fault.

billkris

110

December 05, 2017 07:00 PM

SoldierOfLight is correct that one of the main purposes of WriteBufferImmediate is to provide synchronized writes with work done in the pipeline using MARKER_IN and MARKER_OUT modes. However, MODE_DEFAULT is not a synchronizing operation. In fact, the purpose of MODE_DEFAULT is to enable quick, stochastic writes to buffer locations, such as updating a few constants. This also eliminates the need for an upload-heap staging buffer for these cases.

The buffer must be in either the COPY_DEST or COMMON state (will be promoted to COPY_DEST).

We would love to hear your feedback on how this affects your application performance. Also, let me know if you have any more questions.

SoldierOfLight

2,378

December 05, 2017 07:19 PM

Right, to clarify, the barrier to COPY_DEST would cause the write to be serialized with the previous read operation. However, if you weren't previously reading from the resource in the command list, then yes, WriteBufferImmediate is an excellent replacement for CopyBufferRegion.

turanszkij

Author

545

December 07, 2017 04:42 PM

On ‎05‎/‎12‎/‎2017 at 7:19 PM, SoldierOfLight said:
Right, to clarify, the barrier to COPY_DEST would cause the write to be serialized with the previous read operation. However, if you weren't previously reading from the resource in the command list, then yes, WriteBufferImmediate is an excellent replacement for CopyBufferRegion.

Right now I am using an upload heap allocator that the CPU writes and issues a CopyBufferRegion into a default heap resource. Each buffer has one default heap and the upload heap is a global heap used by all buffers. This way on each update I will have:

allocate next chunk from upload heap
memcpy into upload heap
transition barrier from constant buffer to copy_dest
CopyBufferRegion(default_heap, 0, upload_heap, upload_heap_offset, dataSize)
transition back to constant buffer
bind constant buffer to pixel shader

Do you think this would be acceptable/standard way of doing this? I could not test perf yet, I'm just setting everything up. Data seems correct in the debugger.

The WriteBufferImmediate would be nearly exactly the same, but I copy my constant buffer to the command list.

Wicked Engine

SoldierOfLight

2,378

December 07, 2017 04:46 PM

What you've implemented is the D3D11 equivalent of UpdateSubresource on a default constant buffer. WriteBufferImmediate would be roughly the same thing. In my experience, most people prefer to implement the D3D11 equivalent of Map(DISCARD) on a dynamic constant buffer, which would mean just binding your upload heap directly to the pixel shader.

MJP

20,297

December 07, 2017 11:33 PM

For my persistent "dynamic" buffers I like to have a "CPUWritable" flag that lets you have two different behaviors. If that flag is set, the buffer is allocated out of an UPLOAD heap and can be written to directly by the CPU. To make sure that the CPU doesn't overwrite something that the GPU is reading, the buffer is internally double-buffered, and the buffers are swapped when the contents are changed by the CPU. With this set up you can only flip the buffer at most once per frame (where a "frame" is denoted by a fenced submission of multiple command lists to the DIRECT queue, followed by a Present), so I have an assert to track which frame the buffer was last updated.

If the CPUWritable flag is false, then the contents have to be updated by writing to temporary UPLOAD memory first, and then copying that to the actual buffer memory in a DEFAULT heap. However I do it a little differently than you're proposing, since I use a COPY queue to do the copy instead of using a DIRECT queue. Doing it on the copy queue is trickier since you have multi-queue synchronization involved, but the upside is that the copy can potentially start earlier and run alongside other graphics work (which you usually want to do for initializing static resources). To again avoid writing something that the GPU is reading from, I also double-buffer in this case and only allow at most 1 update per frame. For the temporary memory from an UPLOAD heap that's used as a staging area, I have a ring buffer that tracks fences to know when it can move the start pointer forward.

With your approach of doing the copy on the DIRECT queue, the nice part would be that it will be synchronized with the graphics work on the GPU timeline. This means that you don't need to double-buffer, or do any synchronization beyond your barriers. But the downside is that the copy will happen synchronously with your graphics work, instead of "hiding" in other work. You'll also have to track your fence on the DIRECT queue to know when to free your chunk from the UPLOAD heap.

For choosing between whether to keep your buffer in UPLOAD memory or copy into DEFAULT memory, the best choice most likely depends on how you access the data. If the data is small and you're not going to do repeated random accesses to it, UPLOAD is probably fine (this covers a lot of constant buffers). If the data is larger and you access it multiple times, then it's probably worth copying it to DEFAULT so that you get full access speeds on the GPU (something like a StructuredBuffer full of lights for a forward+ renderer would probably fall into this category).

Anyway, I just wanted to share what I'm doing to give you a few ideas. I'm not claiming to have the best possible approaches here, so feel free to do what works best for you and your engine.

EDIT: I forgot to add some links to my code for reference. You can find the buffer code here, and the upload queue code here. Just be aware that the descriptor management is a bit complicated since that code uses persistent bindless descriptor indices, so there's some jumping through hoops to make sure that the descriptor index doesn't have to change when the buffer is updated.

The Blog | The Book

turanszkij

Author

545

December 08, 2017 11:46 AM

Thanks for this, great information.

12 hours ago, MJP said:
You'll also have to track your fence on the DIRECT queue to know when to free your chunk from the UPLOAD heap.

I've been thinking about just leaving the fence and freeing the UPLOAD heaps on frame start. I have unique upload heaps per frame for double (or triple) buffering. I will have a fence only when there are no more frames available to be queued up which the GPU hasn't finished yet.

Also very interesting way of using the copy queue, I will not get into that yet but seems like an interesting technique. I heard that the copy queue could be slower, but it would use different hardware units so utilization could be better. Could you compare this with different hardware vendors as well?

This also cleared up some confusion, thanks for this:

18 hours ago, SoldierOfLight said:
What you've implemented is the D3D11 equivalent of UpdateSubresource on a default constant buffer. WriteBufferImmediate would be roughly the same thing. In my experience, most people prefer to implement the D3D11 equivalent of Map(DISCARD) on a dynamic constant buffer, which would mean just binding your upload heap directly to the pixel shader.

I just thought that you can't bind an upload heap as shader resource. I guess this way you are creating constant buffer views for each allocation from the heap for binding to the descriptor tables? Does an UPLOAD heap which is shader visible need unmapping or can it also stay mapped forever?

Wicked Engine

SoldierOfLight

2,378

December 08, 2017 03:36 PM

3 hours ago, turanszkij said:
I guess this way you are creating constant buffer views for each allocation from the heap for binding to the descriptor tables? Does an UPLOAD heap which is shader visible need unmapping or can it also stay mapped forever?

Yep, that's right, and no it doesn't need to be unmapped.

MJP

20,297

December 08, 2017 11:43 PM

11 hours ago, turanszkij said:
Also very interesting way of using the copy queue, I will not get into that yet but seems like an interesting technique. I heard that the copy queue could be slower, but it would use different hardware units so utilization could be better. Could you compare this with different hardware vendors as well?

I don't have any comprehensive numbers at the moment, so I'll have to try to set up a benchmark at some point. I would guess that the difference would be pretty minimal unless you're uploading a very large buffer. For me it was also somewhat convenient to use the COPY queue since I already had a system in place for initializing resources using the COPY queue, and the buffer updates go through the same system. The IHV's have recommended using the COPY queue for resource initialization, since the DMA units are optimized for pulling lots of data over the PCI-e bus without disrupting rendering too much (which is necessary in D3D11 games that stream in new textures while gameplay is going on).

The Blog | The Book

WriteBufferImmediate use-cases

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

WriteBufferImmediate use-cases

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines