Wave intrisics, interlocked operations, pixel shaders and OIT.

Graphics and GPU Programming Programming

Started by Gabriel Lassonde September 13, 2023 05:04 PM

6 comments, last by Gabriel Lassonde 8 months ago

Author

September 13, 2023 05:04 PM

Hi,

Not sure what I am doing wrong but I was trying to reduce the number of interlocked operations I perform in a pixel shader using wave operations. This is the classic per-pixel linked list order independent translucency algo mixed with the classic wave intrinsic example.

Here's the code:

void addOITSample(uint2 coord,
				  RWStructuredBuffer< uint > oit_rw_grid,
				  RWStructuredBuffer< uint > oit_rw_samples,
				  OITParameters oit_parameters,
				  float3 color,
				  float transmittance,
				  float depth)
{
	// Allocate sample
#if false
	uint active_offset = WavePrefixCountBits(true);
	uint active_count  = WaveActiveCountBits(true);
	uint sample_index;
	if(WaveIsFirstLane())
		InterlockedAdd(oit_rw_samples[0], active_count, sample_index);
	sample_index = WaveReadLaneFirst(sample_index);
	sample_index += active_offset;
#else
	uint sample_index;
	InterlockedAdd(oit_rw_samples[0], 1, sample_index);
#endif
	// If allocation succeeeded
	if(sample_index < oit_parameters.sample_count)
	{
		// Add sample to list
		uint list_index = coord.x + coord.y * oit_parameters.width;

		uint next_pointer;
		InterlockedExchange(oit_rw_grid[list_index], sample_index, next_pointer);

		// Output sample
		oit_rw_samples[1 + sample_index * 3 + 0] = packUFloat(depth, 24, 8) | packUFloat(transmittance, 8, 0);
		oit_rw_samples[1 + sample_index * 3 + 1] = float3_to_r11g11b10(color);
		oit_rw_samples[1 + sample_index * 3 + 2] = next_pointer;
	}
}

You see, If I enable the wave intrisinc version of the sample allocation I get wrong visual result and eventually a GPU crash. Any idea?

Aressera

3,145

September 13, 2023 10:21 PM

I know nothing about what you are doing, but I recently saw some artifacts similar to that when trying to read a depth buffer texture that was currently attached as a render target. I guess there is some kind of data race going on.

Gabriel Lassonde

Author

September 14, 2023 02:30 AM

For context, OIT is order independent transparency/translucency. Here's a nice article about the technique:

https://interplayoflight.wordpress.com/2022/06/25/order-independent-transparency-part-1/

And

https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics

JoeJ

4,213

September 14, 2023 10:32 AM

	if(WaveIsFirstLane())
		InterlockedAdd(oit_rw_samples[0], active_count, sample_index);
		
// maybe you need a barrier here, so the first lane is ensured to be done getting the index before other threads access it.
		

	sample_index = WaveReadLaneFirst(sample_index);

Gabriel Lassonde

Author

September 14, 2023 10:52 PM

@JoeJ Lanes in a wave are all synchronous. That's the appeal of using wave operations. Its like intel's SIMD on CPU in that regard. So unlike threads in a thread group, you do not need a barrier to synchronize threads (lanes) in a wave.

JoeJ

4,213

September 15, 2023 06:30 AM

Gabriel Lassonde said:
Lanes in a wave are all synchronous. That's the appeal of using wave operations. Its like intel's SIMD on CPU in that regard. So unlike threads in a thread group, you do not need a barrier to synchronize threads (lanes) in a wave.

Yes, but in cases where this applies, compilers will remove redundant barriers.

However, i'm not sure it always applies. E.g. you run on AMD GPU which decides to use 64 threads WGP mode although the simds are only 32 threads wide, or the same for Intel GPUs which might process only 8 threads in lockstep within a larger workgroup afaik.
I really don't know how this maps to pixel shaders though. Maybe there is indeed never a need for execution barriers. But then i'm still concerned about memory barriers, which might be needed regardless.

It seems your artifacts happen only on triangle edges, indicating a problem only where a thread group processes multiple triangles, so likely some synchronization is missing.
I'd give it a try (in case barriers in pixel shaders are at all possible ofc.).

Btw, how does your performance improve from using wave intrinsics?

Gabriel Lassonde

Author

September 16, 2023 01:19 AM

@JoeJ Memroy barriers are a compute shader thing that does not apply to pixel shaders (they have no groups). https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/groupmemorybarrier

Wave intrisics, interlocked operations, pixel shaders and OIT.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Wave intrisics, interlocked operations, pixel shaders and OIT.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines