Unexpected output result in compute shader with InterlockedAdd()

Graphics and GPU Programming Programming DX12 HLSL

Started by SquallLiu September 16, 2020 02:06 PM

3 comments, last by SquallLiu 3 years, 7 months ago

SquallLiu

Author

September 16, 2020 02:06 PM

Recently I implemented the forward+ tile culling for point light.

It seems there is a problem when storing variable with the index return by InterlockedAdd().

Here is the code section:

for (uint lightIndex = _threadIdx; lightIndex < _NumPointLight; lightIndex += TILE_SIZE * TILE_SIZE)
{
   ... calc tile plane, not important for this question so I skip them here ...

   bool overlapping = SphereInsideFrustum(float4(lightPos, light.range), plane);
   if (overlapping)
   {
       uint idx = 0;
       InterlockedAdd(tilePointLightCount, 1, idx);
       tilePointLightArray[idx] = lightIndex;
   }
}
GroupMemoryBarrierWithGroupSync();

if (_threadIdx == 0)
{
    // store opaque
    _TileResult.Store(tileOffset, tilePointLightCount);
    uint offset = tileOffset + 4;
    for (uint i = 0; i < tilePointLightCount; i++)
    {
        _TileResult.Store(offset, tilePointLightArray[i]);
        offset += 4;
    }
}

tilePointLightCount & tilePointLightArray are group shared variable.

And _threadIdx comes from SV_GroupIndex, each thread will cull 1 light.

Most samples use index return from InterlockedAdd when storing group shared variable.

However I got unexpected glitch tiles:

I've checked my algorithm and I'm sure the calculation is correct (including frustum, sphere-plane intersection).

Even I change overlapping to true and force shader to store every light index in array, the glitch still exists.

So I tried another code:

if (overlapping)
{
    uint idx = 0;
    InterlockedAdd(tilePointLightCount, 1, idx);
    tilePointLightArray[lightIndex] = true;
}

if (_threadIdx == 0)
{
    // store opaque
    _TileResult.Store(tileOffset, tilePointLightCount);
    uint offset = tileOffset + 4;
    for (uint i = 0; i < _NumPointLight; i++)
    {
        if (tilePointLightArray[i])
        {
            _TileResult.Store(offset, i);
            offset += 4;
        }
    }
}

This time, I mark the light that is overlapped with tile and store the light index later with for loop.

Now the glitch is gone!

Both code section give me a boost on forward lighting calculation from 11ms to 0.48ms with 417 point lights.

However I need to use 2nd code section to get stable output.

Is there any problem in my 1st code?It should works like a charm.

Or is it a driver issue?My card is RTX 2070 with driver version 452.06.

I've heard other cases about InterlockedAdd() with nvidia card. For example:

https://forums.developer.nvidia.com/t/interlockedadd-limitations-since-v-334-67beta-v-334-89/32522

https://www.gamedev.net/forums/topic/694086-compute-shaders-synchronization-issue/

https://stackoverflow.com/questions/54794206/calling-interlockedadd-on-rwbyteaddressbuffer-multiple-times-gives-unexpected-re

I've tried to figure out this problem few days. Any suggestions would be great.

I really want to use 1st code, culling time is slower with 2nd code. (0.21ms -> 1.0ms if I use 2nd code.)

Thanks!

MJP

20,297

September 20, 2020 12:19 AM

Driver/compiler bugs are always possible, but in general they're going to be unlikely. Usually you hit them in code paths that aren't commonly run, and doing an atomic add on a variable in shared memory is fairly common. For instance the one gamedev forum link you have there is a much more complex scenario involving indirect argument buffers and inter-GPU synchronization/barriers.

Since I can only see a portion of your code, I'll have to ask a few questions:

Are you making sure to initialize your shared light count to 0 for every thread group?
Is it possible that you're overflowing the maximum number of light indices that you can store in the shared array?

Some other notes that aren't related to your current problem:

If you're computing the sub-frustum planes inside the light loop that's not strictly necessary, you should be able to pull that out of the loop
You'll likely get performance if you distribute your final writes of the light indices to a buffer among the threads in your thread group, rather than only having the first thread loop over all of them

The Blog | The Book

SquallLiu

Author

September 21, 2020 05:44 AM

// result
RWByteAddressBuffer _TileResult : register(u0);
RWByteAddressBuffer _TransTileResult : register(u1);

groupshared uint minDepthU;
groupshared uint maxDepthU;
groupshared uint tilePointLightCount;
groupshared uint transTilePointLightCount;
groupshared uint tilePointLightArray[1024];
groupshared uint transTilePointLightArray[1024];

[RootSignature(ForwardPlusTileRS)]
[numthreads(TILE_SIZE, TILE_SIZE, 1)]
void ForwardPlusTileCS(uint3 _globalID : SV_DispatchThreadID, uint3 _groupID : SV_GroupID, uint _threadIdx : SV_GroupIndex)
{
	float depth = -1;

	if (_globalID.x >= _ScreenSize.x || _globalID.y >= _ScreenSize.y)
	{
		// out of range
		// do nothing here, we need to check depth within a tile
	}
	else
	{
		// fetch depth
		depth = _TexTable[_DepthIndex][_globalID.xy].r;
	}

	// init thread variable and sync
	if (_threadIdx == 0)
	{
		minDepthU = UINT_MAX;
		maxDepthU = 0;
		tilePointLightCount = 0;
		transTilePointLightCount = 0;
	}
	GroupMemoryBarrierWithGroupSync();

	// find min/max depth
	if (depth != -1.0)
	{
		uint depthU = asuint(depth);

		InterlockedMin(minDepthU, depthU);
		InterlockedMax(maxDepthU, depthU);
	}
	GroupMemoryBarrierWithGroupSync();

	// tile data
	uint tileIndex = _groupID.x + _groupID.y * _TileCountX;
	uint tileOffset = GetPointLightOffset(tileIndex);

	// calc frustum plane
	float minDepthF = asfloat(minDepthU);
	float maxDepthF = asfloat(maxDepthU);
	maxDepthF = lerp(maxDepthF, minDepthF + FLOAT_EPSILON, (maxDepthF - minDepthF) < FLOAT_EPSILON);

	float4 plane[6];
	CalcFrustumPlanes(_groupID.x, _groupID.y, maxDepthF, minDepthF, plane);

	for (uint lightIndex = _threadIdx; lightIndex < _NumPointLight; lightIndex += TILE_SIZE * TILE_SIZE)
	{
		// light overlap test
		SqLight light = _SqPointLight[lightIndex];
		float3 lightPosV = mul(SQ_MATRIX_V, float4(light.world.xyz, 1.0f)).xyz * float3(1, 1, -1);

		// test opaque
		bool overlapping = SphereInsideFrustum(float4(lightPosV, light.range), plane);
		if (overlapping)
		{
			uint idx = 0;
			InterlockedAdd(tilePointLightCount, 1, idx);
			tilePointLightArray[idx] = lightIndex;
		}

		// test transparent
		plane[4].w = 0.0f;
		overlapping = SphereInsideFrustum(float4(lightPosV, light.range), plane);
		if (overlapping)
		{
			uint idx = 0;
			InterlockedAdd(transTilePointLightCount, 1, idx);
			transTilePointLightArray[idx] = lightIndex;
		}
	}
	GroupMemoryBarrierWithGroupSync();

	// output by one thread
	if (_threadIdx == _NumPointLight)
	{
		// store opaque
		_TileResult.Store(tileOffset, tilePointLightCount);
		uint offset = tileOffset + 4;
		for (uint i = 0; i < tilePointLightCount; i++)
		{
			//if (tilePointLightArray[i])
			{
				_TileResult.Store(offset, tilePointLightArray[i]);
				offset += 4;
			}
		}

		// store transparent
		_TransTileResult.Store(tileOffset, transTilePointLightCount);
		offset = tileOffset + 4;
		for (i = 0; i < transTilePointLightCount; i++)
		{
			//if (transTilePointLightArray[i])
			{
				_TransTileResult.Store(offset, transTilePointLightArray[i]);
				offset += 4;
			}
		}
	}
}

Here is full code (1st method) except some custom function(Since they'are not the problem).

Yes the light count is initialized to 0, since it is a group shared variable I only init it on one thread.
I use _NumPointLight in for loop condition and I'm sure this number is correct, overflowing should not happen.

I wonder if I still miss some steps so that the result is unexpected?

Pull out frustum planes from loop and distribute final writes are both useful suggestions.

Thanks MJP!

SquallLiu

Author

September 22, 2020 05:40 PM

After a further testing of my code.

I realize the bug is somewhere else lol.

The tile culling is fine.

The problem is how I use the result later, my algorithm doesn't work well if the light index is not consistent through tiles.

For example, Tile 0 & Tile 1 both are covered by Light0~Light5. Where both Tiles stored:

Tile 0: 0,1,2,3,4

Tile 1: 0,2,3,4,1

^^^ This causes the glitch. I will modify my methods.

InterlockedAdd() works well.

My bad!

Unexpected output result in compute shader with InterlockedAdd()

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Unexpected output result in compute shader with InterlockedAdd()

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines