Wave intrisics, interlocked operations, pixel shaders and OIT.

Started by
6 comments, last by Gabriel Lassonde 8 months ago

Hi,

Not sure what I am doing wrong but I was trying to reduce the number of interlocked operations I perform in a pixel shader using wave operations. This is the classic per-pixel linked list order independent translucency algo mixed with the classic wave intrinsic example.

Here's the code:

void addOITSample(uint2 coord,
				  RWStructuredBuffer< uint > oit_rw_grid,
				  RWStructuredBuffer< uint > oit_rw_samples,
				  OITParameters oit_parameters,
				  float3 color,
				  float transmittance,
				  float depth)
{
	// Allocate sample
#if false
	uint active_offset = WavePrefixCountBits(true);
	uint active_count  = WaveActiveCountBits(true);
	uint sample_index;
	if(WaveIsFirstLane())
		InterlockedAdd(oit_rw_samples[0], active_count, sample_index);
	sample_index = WaveReadLaneFirst(sample_index);
	sample_index += active_offset;
#else
	uint sample_index;
	InterlockedAdd(oit_rw_samples[0], 1, sample_index);
#endif
	// If allocation succeeeded
	if(sample_index < oit_parameters.sample_count)
	{
		// Add sample to list
		uint list_index = coord.x + coord.y * oit_parameters.width;

		uint next_pointer;
		InterlockedExchange(oit_rw_grid[list_index], sample_index, next_pointer);

		// Output sample
		oit_rw_samples[1 + sample_index * 3 + 0] = packUFloat(depth, 24, 8) | packUFloat(transmittance, 8, 0);
		oit_rw_samples[1 + sample_index * 3 + 1] = float3_to_r11g11b10(color);
		oit_rw_samples[1 + sample_index * 3 + 2] = next_pointer;
	}
}

You see, If I enable the wave intrisinc version of the sample allocation I get wrong visual result and eventually a GPU crash. Any idea?

Allocation without wave intrinsics
Allocation with wave intrinsics
Advertisement

I know nothing about what you are doing, but I recently saw some artifacts similar to that when trying to read a depth buffer texture that was currently attached as a render target. I guess there is some kind of data race going on.

For context, OIT is order independent transparency/translucency. Here's a nice article about the technique:

https://interplayoflight.wordpress.com/2022/06/25/order-independent-transparency-part-1/

And

https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics

	if(WaveIsFirstLane())
		InterlockedAdd(oit_rw_samples[0], active_count, sample_index);
		
// maybe you need a barrier here, so the first lane is ensured to be done getting the index before other threads access it.
		

	sample_index = WaveReadLaneFirst(sample_index);

@JoeJ Lanes in a wave are all synchronous. That's the appeal of using wave operations. Its like intel's SIMD on CPU in that regard. So unlike threads in a thread group, you do not need a barrier to synchronize threads (lanes) in a wave.

Gabriel Lassonde said:
Lanes in a wave are all synchronous. That's the appeal of using wave operations. Its like intel's SIMD on CPU in that regard. So unlike threads in a thread group, you do not need a barrier to synchronize threads (lanes) in a wave.

Yes, but in cases where this applies, compilers will remove redundant barriers.

However, i'm not sure it always applies. E.g. you run on AMD GPU which decides to use 64 threads WGP mode although the simds are only 32 threads wide, or the same for Intel GPUs which might process only 8 threads in lockstep within a larger workgroup afaik.
I really don't know how this maps to pixel shaders though. Maybe there is indeed never a need for execution barriers. But then i'm still concerned about memory barriers, which might be needed regardless.

It seems your artifacts happen only on triangle edges, indicating a problem only where a thread group processes multiple triangles, so likely some synchronization is missing.
I'd give it a try (in case barriers in pixel shaders are at all possible ofc.).

Btw, how does your performance improve from using wave intrinsics?

@JoeJ Memroy barriers are a compute shader thing that does not apply to pixel shaders (they have no groups). https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/groupmemorybarrier

This topic is closed to new replies.

Advertisement