Advertisement

Direct3D 12 Staging Resources

Started by May 06, 2014 06:20 PM
3 comments, last by maxmcmullen 10 years, 6 months ago

Over the years since Direct3D 11 was released, I would venture to guess that one of the most common questions / problems that has arisen from new developers was the requirement to have a staging resource when you wanted to have both CPU and GPU access to a resource. It is counter-intuitive to have this requirement, and if your resources are really big (i.e. for 3D textures) then it is even more of a trouble since you have to keep a huge resource around just for copying data back and forth. The alternative is to temporarily create a resource just for the transfer event and then release it, but that is against recommended practices.

So this post is an open request to the Direct3D 12 developers (of which I know at least one of them is lurking around... Max!). Please allow the control of staging properties of a resource with the resource barrier objects. If we can change the CPU / GPU access properties with a barrier transition, then it will let the developer have easier control over his/her resources and reduce the number of API calls needed to copy data back to the CPU. This should theoretically improve performance (fewer API calls) and it lets the algorithm implementer explicitly show what he is trying to achieve.

This functionality may already be possible in the current state of the API (I haven't seen any more than the BUILD 2014 talk) but if it isn't, please consider adding this!

If there are other topics like this that the general community sees as relevant or important to change for D3D12, please post those ideas so that the feedback gets to the right people!

D3D:
- No more cap-bits (at least, no more new cap-bits), nobody need another extension/capabilities hell...
- Unified virtual memory between CPU and GPU (AMD and NVIDIA are working on that on HSA/hUMA and CUDA 6 respectively, dunno Intel...), this also should resolve what Jason Z is asking... EDIT: intel has a similar thing called "Direct Resource Access"..
- Depth Bounds Test, AMD and NVIDIA have their own extensions...
- Adaptive Order Independent Transparency/Order-Independent Transparency Approximation with Pixel Synchronization... intel did it...

DXGI:
-10-bit output in windowed mode (not only in full-screen), 10-bit IPS monitor are becoming common on gamers too (well, fake 10 bit, they are mostly "true 8-bit with A-FRC"..)
-truly working v-sync in windowed mode

Tools:
-better hlsl support, with auto-complete and code analysis ohmy.png
"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/
Advertisement

AFAIK, D3D12 is going to support hardware all the way back to 9.3 level, so this would have to work across old and new architectures.

Newer devices can just put the resource into a memory location that's mapped in the address space of your CPU project and in the GPU's address space. No staging/copies required (just manual fencing by the application to avoid race conditions).

Older devices will still require you the creation of a CPU staging resource, which the GPU can transfer the data into.

So - I guess you'd want this staging resource to be a driver-internal detail, rather than an application detail? Instead, at creation time we tell the driver that we want both CPU & GPU access (which lets the driver create a GPU-local and CPU-staging allocaiton if required), and then at runtime we issue these barriers to tell the driver when to transition from CPU-ownership to GPU-ownership and back. On newer devices, this could be a NOP, but on other devices it would perform the copying between staging resources as required?

[edit] The alternative is what mantle is doing, described by the Dice guys here: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Rendering-Battlefield-4-with-Mantle-Johan-Andersson.ppsx

Where the application can choose where resources will be placed -- and would have to implement 'staging' resources themselves (and the required copies) if there was no GPU+CPU mapped heap reported as available by the driver, or if they decide there's no heap that has high enough GPU-write and CPU-read performance.

I think I'm missing something.


Newer devices can just put the resource into a memory location that's mapped in the address space of your CPU project and in the GPU's address space
Ok but the performance level isn't going to be the same. If I understand correctly what you're writing this would solve the issue of which component takes care of syncronization and use but I'm having difficulty in understanding the performance pattern involved.

I mean, suppose I have this resource GPU-only resource and I make it staging. Even on a device that allows mapping of device memory, isn't the access performance going to be very different?

I sure understand the problem, but I don't get the solution ticking in my head.

Also, why a resource is to be staging in the first place? OpenCL allows host buffers to be bound as kernel outputs, which is a smart thing to do if you're building incremental queues with like .001% chance of a Work Item result to get there.

Again, this methodology does have its own set of issues regarding bandwidth and latency but it sounds good to me. There's no explicit concept of staging in OCL, just the flags and the location (like Mantle, as far as I understand).

Previously "Krohm"

To Jason's initial post:

Some of the high level details were already revealed to respond to your post but it's quite a jump to get to the API details you probably want to hear. D3D 12 doesn't have strongly typed memory allocations like D3D 11, which strictly limited the dimensionality and usage of memory at creation time. On 12, the main memory allocation parameters are CPU access/cacheability and GPU locality vs CPU locality. Some examples:

Dynamic vertex buffers in 11 would be an application managed ring buffer of memory in 12, allocated with write combined CPU cacheability and CPU locality.

11 style default 2D textures do not have CPU access and have GPU locality. 12 will also expose the ability to map multidimensional GPU local resources, useful for reading out the results of a reduction operation with low-latency for example. In this case it would be write combined CPU access with GPU locality. In the GDC unveil of D3D 12 this was briefly mentioned in a slide, called "map default" or "swizzled texture access" IIRC.

Cacheability and locality will not be mutable properties of memory allocations but 12 will allow memory of those given properties to be retasked for multiple resource types (1D/2D/3D, VB/Texture/UAV/..., width/height/depth, etc). More details later this year....

D3D 12 will have multiple methods for moving data between CPU & GPU, each serving different scenarios/performance requirements. More details later this year... smile.png.

To Alessio1989's reply:

I expect the feature level/cap evolution to remain the same. D3D will expose some new features as independent caps and simultaneously bake sets of common caps together into a new feature level to guarantee support and reduce the implementation/testing matrix for developers. It's the best of both worlds between D3D9 and D3D10+. 9 allowed fine-grained feature additions without forcing hardware vendors to perfectly align on feature set but created an unsupportable mess of combinations. 10 allowed developers to rely on a combination of features but tended to delay API support for hardware features until most GPU vendors had built or nearly built that combination in hardware. 11 & 12 have evolved to have caps for initial exposure with feature levels baking in a common set over time.

Max McMullen

Direct3D Development Lead

Microsoft

This topic is closed to new replies.

Advertisement