Issues with sync between compute shader and graphics pipeline

Graphics and GPU Programming Programming

Started by rick_appleton January 27, 2022 06:20 PM

4 comments, last by JoeJ 3 years, 1 month ago

rick_appleton

Author

864

January 27, 2022 06:20 PM

Hello all,

Usually I can solve issues like this myself, but I seem unable to find the bug in my code this time. I think I must be misunderstanding something about the Vulkan sync model for this to happen, but am unsure what it is.

The setup

I've created a Vulkan application to play around with cluster culling. It creates a large grid mesh, with about 30M triangles, which are then clustered together in groups of 128.

It then runs a compute shader that inspects each cluster and decides to cull the cluster or not. Currently it simply accepts all clusters as visible and outputs them into a new sequential buffer. The compute shader also fills a DrawIndirect struct with the proper cluster size (vertex count) and visible cluster count (instance count).

Then with an DrawIndirect command I render the visible clusters.

The issue

While this works just fine for ‘small’ batches, I was using the Lucy model as a large scale test case and noticed issues there. For some reason, the vertex count of the DrawIndirect struct isn't being filled, and the value for the instance count is too low.

The details

Compute shader code for the cluster culling (which doesn't currently cull anything):

#version 430
#extension GL_ARB_compute_shader : enable

#define CLUSTER_SIZE 128

struct dd_mesh_cluster_data_t {
  uint base_vertex;
  uint base_index;
  int padding_A;
  int padding_B;
};

struct dd_draw_indirect_command_t {
  uint vertex_count;
  uint instance_count;
  uint vertex_offset;
  uint instance_offset;
};

layout(local_size_x = 64) in;
layout(local_size_y = 1) in;
layout(local_size_z = 1) in;

layout(push_constant) uniform Parameters { 
  uint cluster_count;
} params;

layout(std430, binding = 0) readonly buffer all_clusters_t {
  dd_mesh_cluster_data_t all_clusters[];
};
layout(std430, binding = 1) writeonly buffer visible_clusters_t {
  dd_mesh_cluster_data_t visible_clusters[];
};
layout(std430, binding = 2) volatile buffer visible_cluster_count_t {
  dd_draw_indirect_command_t cluster_draw_indirect[];
};

shared uint local_count;
shared uint offset;
void main() {
  // Reset the count.
  if (gl_LocalInvocationID.x == 0) {
    cluster_draw_indirect[0].vertex_count = 0;
    cluster_draw_indirect[1].vertex_count = 0;
    local_count                           = 0;
  }

  memoryBarrierBuffer();
  memoryBarrierShared();
  memoryBarrier();
  barrier();

  const uint id           = gl_GlobalInvocationID.x;
  const bool keep_cluster = ((id < params.cluster_count));
  
  uint output_cluster_index = 0;
  if (keep_cluster) {
    output_cluster_index = atomicAdd(local_count, 1);
  }

  memoryBarrierBuffer();
  memoryBarrierShared();
  memoryBarrier();
  barrier();

  if (gl_LocalInvocationID.x == 0) {
    offset = atomicAdd(cluster_draw_indirect[1].vertex_count, local_count);
  }

  memoryBarrierBuffer();
  memoryBarrierShared();
  memoryBarrier();
  barrier();

  if (gl_LocalInvocationID.x == 0) {
    visible_clusters[gl_WorkGroupID.x].padding_A = int(gl_WorkGroupID.x);
    visible_clusters[gl_WorkGroupID.x].padding_B = int(local_count);
  }

  if (keep_cluster) {
    output_cluster_index += offset;
    visible_clusters[output_cluster_index].base_vertex = all_clusters[id].base_vertex;
    visible_clusters[output_cluster_index].base_index  = all_clusters[id].base_index;
  }

  memoryBarrierBuffer();
  memoryBarrierShared();
  memoryBarrier();
  barrier();

  if (gl_GlobalInvocationID.x == 0) {
    uint visible_clusters = cluster_draw_indirect[1].vertex_count;
    // It appears the following lines isn't executed when cluster count is large.
    cluster_draw_indirect[0].vertex_count    = CLUSTER_SIZE * 3;
    cluster_draw_indirect[0].instance_count  = visible_clusters;
    cluster_draw_indirect[0].vertex_offset   = 0;
    cluster_draw_indirect[0].instance_offset = 0;
  }
}

This is a screenshot of a RenderDoc capture of the issue.

You can see compute dispatch with 4100 workgroups (= 262369 clusters / 64 work group size)
Not visible here, but there is a PushConstant setting the cluster count correctly to 262369.
The DrawIndirect struct is in element 2 of Buffer 242 which is visible on the right
Visible cluster count is also in the first data element of element 3 of Buffer 242 (this was a test)
Visible cluster count is output as 0x140, which is 320 clusters (incorrect)
DrawIndirect is called with instance count of 320 (incorrect, but matches data), and vertex count 0 (incorrect, but matches data)
Although the compute shader should be filling the DrawIndirect struct, I've also saved the value elater in the buffer, and I do a CopyBuffer to put it in the right place. This step shouldn't be needed but I've been trying all kinds of things to get this to work stable.
You can't see it in the screen shot, after the Dispatch, but before the CopyBuffer is a PipelineBarrier(, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,…). As I understand it this should ensure all the invocations of the shader are finished before continuing with the next command.

What to solve?

At the very least I expect the shader to write 128 * 3 = 378 into the vertex count element of the DrawIndirect struct, as there is always a shader invocation with gl_GlobalInvocationID.x==0. Why this isn't happening is beyond me, and seems to indicate a sync issue (maybe my barrier is wrong).

I'm willing to share the program and or captures (quite big) if they will help, but maybe someone can already point me in the right direction with this info.

[size="1"]Daedalus Development | E-Mail me

JoeJ

4,428

January 27, 2022 07:41 PM

I'm confused from this final block:


if (gl_GlobalInvocationID.x == 0) { 
uint visible_clusters = cluster_draw_indirect[1].vertex_count; 
// It appears the following lines isn't executed when cluster count is large.

It looks like every workgroup would execute this, and so you have a write hazard?
Instead of writing a number to this final draw count, i would expect doing just atomic additions to that.
But that sounds too obvious so i guess i miss something.

About performance, i would expect a speed up if you first fill local buffers of LDS, and write batches of them to VRAM once they have enough data or shader is done.
This would replace very most VRAM atomics with LDS atomics, and lead to better linear memory access. (Probably you planned to do this later anyway.)

rick_appleton

Author

864

January 29, 2022 06:41 AM

Thanks for your reply and the help.

Solution

I took your suggestion to bring it back to a single atomic value in the buffer. I started out that way but was partially through optimizing it. While I was doing that I noticed a bug in my original code: the first part of the compute shader is incorrect. gl_GlobalInstanceID.x==0 should clear the buffers, and gl_LocalInstanceID.x==0 should clear the local_count shared variable. From the observed results I'm guessing that one or more of the later work groups reset the global count after they had already been partially updated. I thought the barriers would prevent that, but apparently not. The working code is included at the bottom of my reply.

Questions

I'm not sure I follow your other questions/remarks. Why would the if(gl_GlobalInstanceID.x==0) block be executed for all workgroups? With regards to the LDS remark, as I mentioned I was partway through optimizing the code. I was following along with https://gpuopen.com/learn/fast-compaction-with-mbcnt/ but without the ballot yet. The atomic value would be in shared memory, thus reducing contention on VRAM. Once each invocation has it's output id it writes directly to VRAM to the correct slot, so no contention there.

I couldn't completely map what they say to Vulkan through. I mapping workgroupThreadId to gl_LocalInstanceID, but I have no clue what wavefrontThreadId would map to in Vulkan. Possibly because of that my optimized version wasn't working correctly.

#version 430
#extension GL_ARB_compute_shader : enable

#define CLUSTER_SIZE 128

struct dd_mesh_cluster_data_t {
  uint base_vertex;
  uint base_index;
  int padding_A;
  int padding_B;
};

struct dd_draw_indirect_command_t {
  uint vertex_count;
  uint instance_count;
  uint vertex_offset;
  uint instance_offset;
};

layout(local_size_x = 64) in;
layout(local_size_y = 1) in;
layout(local_size_z = 1) in;

layout(push_constant) uniform
    Parameters {  // specify push constants. on cpp side its layout is fixed at PipelineLayout, and
                  // values are provided via vk::CommandBuffer::pushConstants()
  uint cluster_count;
}
params;

layout(std430, binding = 0) readonly buffer all_clusters_t {
  dd_mesh_cluster_data_t all_clusters[];
};
layout(std430, binding = 1) writeonly buffer visible_clusters_t {
  dd_mesh_cluster_data_t visible_clusters[];
};
layout(std430, binding = 2) volatile buffer visible_cluster_count_t {
  dd_draw_indirect_command_t cluster_draw_indirect;
};

void main() {
  // Reset the count.
  if (gl_GlobalInvocationID.x == 0) {
    cluster_draw_indirect.vertex_count    = CLUSTER_SIZE * 3;
    cluster_draw_indirect.instance_count  = 0;
    cluster_draw_indirect.vertex_offset   = 0;
    cluster_draw_indirect.instance_offset = 0;
  }

  memoryBarrierBuffer();

  const uint id           = gl_GlobalInvocationID.x;
  const bool keep_cluster = ((id < params.cluster_count));
  // const bool keep_cluster = ((id < params.cluster_count) && (id % 2 == 0));
  // const bool keep_cluster = ((id < params.cluster_count) && (all_clusters[id].padding_A > 0));

  uint output_cluster_index = 0;
  if (keep_cluster) {
    output_cluster_index = atomicAdd(cluster_draw_indirect.instance_count, 1);
  }

  if (keep_cluster) {
    visible_clusters[output_cluster_index].base_vertex = all_clusters[id].base_vertex;
    visible_clusters[output_cluster_index].base_index  = all_clusters[id].base_index;
  }
}

[size="1"]Daedalus Development | E-Mail me

JoeJ

4,428

January 29, 2022 09:54 AM

rick_appleton said:
I'm not sure I follow your other questions/remarks. Why would the if(gl_GlobalInstanceID.x==0) block be executed for all workgroups?

Oh - my fault. I did not spot the ‘confusing final block’ i had mentioned used the global id. I thought you used the local id everywhere. : )

The compaction AMD example is what i meant with the optimization proposal, yes. When i worked on VK compute shaders, subgroup functions such as ballot and popcnt were not available yet, and AMDs compiler to support their early extensions was too buggy to use. So i still have to do those optimizations myself in the future and lack experience.

rick_appleton said:
I couldn't completely map what they say to Vulkan through. I mapping workgroupThreadId to gl_LocalInstanceID, but I have no clue what wavefrontThreadId would map to in Vulkan. Possibly because of that my optimized version wasn't working correctly.

Can't answer due to the above. But looking up this: https://www.khronos.org/blog/vulkan-subgroup-tutorial
It looks like it's ‘gl_SubgroupInvocationID’.

JoeJ

4,428

January 29, 2022 10:09 AM

Looking up in more detail i still have some concerns.
Your idea is to use the global id to ensure a task (initializing the counter to zero) is only executed once per dispatch. That's a nice idea, and it's new to me.
But i don't think it is guaranteed that global id 0 is executed before other workgroups. And thus your current solution may not work on any HW or future drivers.
It may happen other workgroups write some data to the cluster count before it is initialized to zero.

What i would usually do here is to move the initialization to a former dispatch, so i can guarantee execution order with API barriers between dispatches.
This ofc. means you might have a whole dispatch with just one thread working to initialize the count, which sucks, but it may be the only ‘correct’ option.

Still, people violate such things by intend, test it on multiple HW and keep it if it works.
An example is UE5 Nanite, where they use persistent threads to implement a multi producer / multi consumer queue.
It's not guaranteed the persistent producer threads keep executing along the many consumer threads, so the system could deadlock in theory. But they tested it, it worked everywhere, so they kept it. (Not sure if i understood the details precisely)

Issues with sync between compute shader and graphics pipeline

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Issues with sync between compute shader and graphics pipeline

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines