Advertisement

[D3D12] Total brightness problem in compute shader

Started by January 13, 2016 11:04 AM
13 comments, last by red75prime 8 years, 9 months ago

The code below is supposed to calculate total brightness of tsrc texture.

The good thing is HD Graphics 4600 calculates it just fine. The bad thing is GTX 980 does not.

The values I read from read-back buffer fluctuate wildly, but they seem to stay below correct value.

I took the code for atomic addition of float values from this thread http://www.gamedev.net/topic/613648-dx11-interlockedadd-on-floats-in-pixel-shader-workaround/

I have no idea what's going on. Thanks in advance.

EDIT: 'globallycoherent' doesn't work. Using 'InterlockedAdd' and summing uint's doesn't work.


#define TotalGroups 32
#define RSDT "RootFlags(0), UAV(u0), DescriptorTable(SRV(t0))" // Descriptor table is required for texture

Texture2D<float4> tsrc: register(t0);
RWByteAddressBuffer total : register(u0);

groupshared float4 bpacked[TotalGroups*TotalGroups];

float brightness(float4 cl) {
  // TODO: replace with correct implementation
  return cl.r + cl.g + cl.b;
}

[RootSignature(RSDT)]
[numthreads(TotalGroups,TotalGroups,1)]
void CSTotal(uint3 gtid: SV_GroupThreadId, uint3 gid : SV_GroupId, uint gindex : SV_GroupIndex, uint3 dtid : SV_DispatchThreadID) {
  uint2 crd = (gid.xy * TotalGroups + gtid.xy)*2;
  float br[4];
  [unroll]
  for (uint x = 0; x < 2; ++x) {
    [unroll]
    for (uint y = 0; y < 2; ++y) {
      // Color outside of tsrc is guarantied to be 0.
      br[y * 2 + x] = brightness(tsrc[crd+uint2(x,y)]);
    }
  }
  bpacked[gindex] = float4(br[0],br[1],br[2],br[3]);
  if (all(dtid == uint3(0, 0, 0))) {
?    // set initial value of total brightness accumulator
    total.Store(0, asuint(0.0));
  };
  AllMemoryBarrierWithGroupSync();
?  // bpacked array now contains brightness in each component of each value

  // reduce bpacked to single value
  [unroll]
  for (uint thres = TotalGroups*TotalGroups / 2; thres > 0; thres /= 2) {
    if (gindex < thres) {
      bpacked[gindex] += bpacked[gindex + thres];
    }
    AllMemoryBarrierWithGroupSync();
  }
  if (gindex == 0) {
    float4 cl = bpacked[0];
    float value = cl.r + cl.g + cl.b + cl.a;

    // First thread in thread group atomically adds calculated brightness to the accumulator
    uint comp, orig = total.Load(0);
    [allow_uav_condition]do
    {
      comp = orig;
      total.InterlockedCompareExchange(0, comp, asuint(asfloat(orig) + value), orig);
    } while (orig != comp);
  }
}

Invocation of compute shader is written in Rust. But it should be sufficiently readable.

I'm sure that Rust bindings to D3D12 are not the cause for the problem. I work with them for months without problems.


        let src_desc = srv_tex2d_default_slice_mip(srcdesc.Format, 0, 1);
        core.dev.create_shader_resource_view(Some(&src), Some(&src_desc), res.total_dheap.cpu_handle(0));

        clist.set_pipeline_state(&self.total_cpso);
        clist.set_compute_root_signature(&self.total_rs);
        clist.set_descriptor_heaps(&[res.total_dheap.get()]);
        clist.set_compute_root_descriptor_table(1, res.total_dheap.gpu_handle(0));
        clist.set_compute_root_unordered_access_view(0, res.rw_total.get_gpu_virtual_address());

        clist.resource_barrier(&[
          *ResourceBarrier::transition(&src,
            D3D12_RESOURCE_STATE_COMMON, D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE),
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_COMMON, D3D12_RESOURCE_STATE_UNORDERED_ACCESS),
        ]);

        clist.dispatch(cw / TOTAL_CHUNK_SIZE, ch / TOTAL_CHUNK_SIZE, 1);
        clist.resource_barrier(&[
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_UNORDERED_ACCESS, D3D12_RESOURCE_STATE_COPY_SOURCE),
        ]);
        clist.copy_resource(&res.rb_total, &res.rw_total);
        clist.resource_barrier(&[
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_COPY_SOURCE, D3D12_RESOURCE_STATE_COMMON),
        ]);
        try!(clist.close());

        core.compute_queue.execute_command_lists(&[clist]);

        wait_for_compute_queue(core, &res.fence, &create_event());

        let total_brightness = res.total_brightness();
        let avg_brightness = total_brightness / cw as f32 / ch as f32;


I knew it. This part looked just too ugly.


  if (all(dtid == uint3(0, 0, 0))) {
?    // set initial value of total brightness accumulator
    total.Store(0, asuint(0.0));
  };

When I replaced it with ClearUnorderedAccessViewUint... I've got program crash on GTX 980. It seems this function is broken on NVidia.

Then I replaced it with


[RootSignature(RSDT)]
[numthreads(1,1,1)]
void CSClearTotal() {
  total[0] = 0;
}

And now it works.

EDIT: Maybe ClearUnorderedAccessViewUint isn't broken. Maybe I don't understand what second parameter means.

Advertisement

I took the code for atomic addition of float values from this thread http://www.gamedev.net/topic/613648-dx11-interlockedadd-on-floats-in-pixel-shader-workaround/

Don't spin your Gpu like this. Read this instead: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

or search "parallel reduction". There's actually faster ways to do it on an nvidia gpu that aren't exposed to dx12 sad.png.

They're reducing twice the number of values in a 1920x1080 texture (4 billion items) in .268ms on something like 5 year old hardware for a reference benchmark.

Don't spin your Gpu like this. Read this instead: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

or search "parallel reduction". There's actually faster ways to do it on an nvidia gpu that aren't exposed to dx12 sad.png.

They're reducing twice the number of values in a 1920x1080 texture (4 billion items) in .268ms on something like 5 year old hardware for a reference benchmark.

I implemented parallel reduction inside a thread group. Atomic addition is performed once per thread group. And performance is not ?that bad. Around 4ms for 3840x2160 R32G32B32A32_FLOAT texture.

Vendor-specific optimizations can be safely postponed, I think.

You might be getting a debug error if ClearUnorderedAccessViewUint fails. I have a GTX 970 and it works fine. You need to have the buffer's uav in a set descriptor heap and also the uav can't be a shader visible heap iirc. That's what got me at first.

I'm still highly skeptical about that atomic float addition. If 4ms is good enough then great, but it seems like a pretty substantial amount of time to me.


You need to have the buffer's uav in a set descriptor heap and also the uav can't be a shader visible heap iirc

Thank you. MSDN doesn't mention any of it and debug layer's message in case of error is not clear at all.

I experimented a bit. GPU descriptor handle can be in any descriptor heap (either set or not, either shader visible or not). Debug layer doesn't complain in any case.

CPU descriptor handle must not be for UAV in shader visible heap.

EDIT: My bad. MSDN has a comment in community additions section, but I use offline docs.

Advertisement

D3D12 is really tricky. It took me two weeks to make the code work on 3 out of 4 GPUs I have.

Key insight is "A shader cannot reliably read from UAV of a resource filled by another shader, you need to use SRV to read from it".

Another one is "Two ways to sum float values produced by different thread groups are a) lock buffer and sum on CPU b) spin in InterlockedCompareExchange"

Key insight is "A shader cannot reliably read from UAV of a resource filled by another shader, you need to use SRV to read from it".

Shouldn't that be possible as long as you issue the appropriate resource transition between the two shader dispatches? The transition tells the driver that there's a data dependency between the two dispatches, so it can insert a wait-for-cache-flush command before the 2nd one, ensuring UAV coherency.

Shouldn't that be possible as long as you issue the appropriate resource transition between the two shader dispatches? The transition tells the driver that there's a data dependency between the two dispatches, so it can insert a wait-for-cache-flush command before the 2nd one, ensuring UAV coherency.

It doesn't work on Microsoft Basic Render Driver and, possibly, on HD 4600 (I have another problems with this one). I just checked it. Also it requires one more resource barrier, d3d12 doesn't allow transitioning from one state into the same state.

It sounds like you should have been using D3D12_RESOURCE_UAV_BARRIER.

"Represents a resource in which all UAV accesses must complete before any future UAV accesses can begin."

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

This topic is closed to new replies.

Advertisement