Advertisement

Simple computation shader - different results on release and debug

Started by July 05, 2018 01:08 PM
4 comments, last by JoeJ 6 years, 7 months ago

Hey,

This is a very strange problem... I've got a computation shader that's supposed to fill 3d texture (voxels in metavoxel) with color, based on particles that cover given metavoxel. And this is the code:


static const int VOXEL_WIDTH_IN_METAVOXEL = 32;
static const int VOXEL_SIZE = 1; 
static const float VOXEL_HALF_DIAGONAL_LENGTH_SQUARED = (VOXEL_SIZE * VOXEL_SIZE + 2.0f * VOXEL_SIZE * VOXEL_SIZE) / 4.0f;
static const int MAX_PARTICLES_IN_METAVOXEL = 32;

struct Particle
{
	float3 position;
	float radius;
};

cbuffer OccupiedMetavData : register(b6)
{
	float3 occupiedMetavWorldPos;
	int numberOfParticles;
	Particle particlesBin[MAX_PARTICLES_IN_METAVOXEL];
};

RWTexture3D<float4> metavoxelTexUav : register(u5);

[numthreads(VOXEL_WIDTH_IN_METAVOXEL, VOXEL_WIDTH_IN_METAVOXEL, 1)]
void main(uint2 groupThreadId : SV_GroupThreadID)
{
	float4 voxelColumnData[VOXEL_WIDTH_IN_METAVOXEL];

	float particleRadiusSquared;
	float3 distVec;

	for (int i = 0; i < VOXEL_WIDTH_IN_METAVOXEL; i++)
		voxelColumnData[i] = float4(0.0f, 0.0f, 1.0f, 0.0f);

	for (int k = 0; k < numberOfParticles; k++)
	{
		particleRadiusSquared = particlesBin[k].radius * particlesBin[k].radius + VOXEL_HALF_DIAGONAL_LENGTH_SQUARED;

		distVec.xy = (occupiedMetavWorldPos.xy + groupThreadId * VOXEL_SIZE) - particlesBin[k].position.xy;

		for (int i = 0; i < VOXEL_WIDTH_IN_METAVOXEL; i++)
		{
			distVec.z = (occupiedMetavWorldPos.z + i * VOXEL_SIZE) - particlesBin[k].position.z;

			if (dot(distVec, distVec) < particleRadiusSquared)
			{
				//given voxel is covered by particle
				voxelColumnData[i] += float4(0.0f, 1.0f, 0.0f, 1.0f);
			}
		}
	}

	for (int i = 0; i < VOXEL_WIDTH_IN_METAVOXEL; i++)
		metavoxelTexUav[uint3(groupThreadId.x, groupThreadId.y, i)] = clamp(voxelColumnData[i], 0.0, 1.0);
}

And it works well in debug mode. This is the correct looking result obtained after raymarching one metavoxel from camera:

image.png.36c06f8c55dec305eaeee91c23ab50a2.png

As you can see, the particle only covers the top right corner of the metavoxel.

However, in release mode The result obtained looks like this:

image.png.a229df450a321d7783b48927f9e58296.png

This looks like the upper half of the metavoxel was not filled at all even with the ambient blue-ish color in the first "for" loop... I nailed it down towards one line of code in the above shader. When I replace "numberOfParticles" in the "for" loop with constant value such as 1 (which is uploaded to GPU anyway) the result finally looks the same as in debug mode.

This is the shader compile method from Hieroglyph Rendering Engine (awesome engine) and it looks fine for me but maybe something's wrong? My only modification was adding include functionality


ID3DBlob* ShaderFactoryDX11::GenerateShader( ShaderType type, std::wstring& filename, std::wstring& function,
            std::wstring& model, const D3D_SHADER_MACRO* pDefines, bool enablelogging )
{
    HRESULT hr = S_OK;

    std::wstringstream message;

    ID3DBlob* pCompiledShader = nullptr;
    ID3DBlob* pErrorMessages = nullptr;

    char AsciiFunction[1024];
    char AsciiModel[1024];
    WideCharToMultiByte(CP_ACP, 0, function.c_str(), -1, AsciiFunction, 1024, NULL, NULL);
    WideCharToMultiByte(CP_ACP, 0, model.c_str(), -1, AsciiModel, 1024, NULL, NULL);

    // TODO: The compilation of shaders has to skip the warnings as errors
    //       for the moment, since the new FXC.exe compiler in VS2012 is
    //       apparently more strict than before.

    UINT flags = D3DCOMPILE_PACK_MATRIX_ROW_MAJOR;
#ifdef _DEBUG
    flags |= D3DCOMPILE_DEBUG | D3DCOMPILE_SKIP_OPTIMIZATION; // | D3DCOMPILE_WARNINGS_ARE_ERRORS;
#endif

    // Get the current path to the shader folders, and add the filename to it.

    FileSystem fs;
    std::wstring filepath = fs.GetShaderFolder() + filename;

    // Load the file into memory

    FileLoader SourceFile;
    if ( !SourceFile.Open( filepath ) ) {
        message << "Unable to load shader from file: " << filepath;
        EventManager::Get()->ProcessEvent( EvtErrorMessagePtr( new EvtErrorMessage( message.str() ) ) );
        return( nullptr );
    }
    LPCSTR s;
    if ( FAILED( hr = D3DCompile(
        SourceFile.GetDataPtr(),
        SourceFile.GetDataSize(),
        GlyphString::wstringToString(filepath).c_str(), //!!!! - this must be pointing to a concrete shader file!!! - only directory would work as well but in that case graphics debugger crashes when debugging shaders
        pDefines,
        D3D_COMPILE_STANDARD_FILE_INCLUDE,
        AsciiFunction,
        AsciiModel,
        flags,
        0,
        &pCompiledShader,
        &pErrorMessages ) ) )

    //if ( FAILED( hr = D3DX11CompileFromFile(
    //    filename.c_str(),
    //    pDefines,
    //    0,
    //    AsciiFunction,
    //    AsciiModel,
    //    flags,
    //    0,//UINT Flags2,
    //    0,
    //    &pCompiledShader,
    //    &pErrorMessages,
    //    &hr
    //    ) ) )
    {
        message << L"Error compiling shader program: " << filepath << std::endl << std::endl;
        message << L"The following error was reported:" << std::endl;

        if ( ( enablelogging ) && ( pErrorMessages != nullptr ) )
        {
            LPVOID pCompileErrors = pErrorMessages->GetBufferPointer();
            const char* pMessage = (const char*)pCompileErrors;
            message << GlyphString::ToUnicode( std::string( pMessage ) );
            Log::Get().Write( message.str() );
        }

        EventManager::Get()->ProcessEvent( EvtErrorMessagePtr( new EvtErrorMessage( message.str() ) ) );

        SAFE_RELEASE( pCompiledShader );
        SAFE_RELEASE( pErrorMessages );

        return( nullptr );
    }

    SAFE_RELEASE( pErrorMessages );

    return( pCompiledShader );
}

Could the shader crash for some reason in mid way through execution? The question also is what could compiler possibly do to the shader code in release mode that suddenly "numberOfParticles" becomes invalid and how to fix this issue? Or maybe it's even sth deeper which results in numberOfParticles being invalid? I checked my constant buffer values with Graphics debugger in debug and release modes and both had correct value for numberOfParticles set to 1...

Alright... this is the first time compiler warnings became really important in my life xD. Especially the warnings generated by hlsl compiler with the flag D3DCOMPILE_WARNINGS_ARE_ERRORS. This is the warning I got with the above compute shader:

image.png.346c12c101113df956244e86c1c807da.png

Though, I thought the driver would handle this case appropriately and setup a sequential queue of threads if there weren't enough registers for all threads to execute... It also appears that this limitation might be just per group of threads because when I replaced 1 group of 32x32 threads with 4 groups of 32x8 threads everything finally works as supposed in release mode. I'm really surprised the driver doesn't handle this automatically in release mode. Could it be that in debug it does this and in release not? Is there some way to force correct behaviour in release mode without manually dividing the threads? Probably it's also driver specific, right? Any comments or insights would be really welcome! Thanks for your time guys anyway

Advertisement

I don't know what's the difference between debug and release here, but you should worry about the shader itself anyways. It is horribly inefficient:

1 hour ago, savail said:

float4 voxelColumnData[VOXEL_WIDTH_IN_METAVOXEL];

By doing so, you create an array, but not in any memory, you do it in registers.

Registers can not be adressed by index, so accessing a value will be compiled like this:

switch (index)

{

case 0: return R0; break:

case 1: return R1; break:

//...

case 31: return R31; 

}

So, accessing a value alone is already totally inefficient, even if the compiler optimizes it with some binary search.

The larger the arry is, the worse. And your Array is very large. 32 * float 4 = 128 registers, alone for this array. On AMD GCN you aim for max 24 (!) registers for max occupancy.

But worse, you require all those registers per thread. So one wavefront requires 128 * 64 threads =  8000 registers, which is a little more than the 24 we aim for. Also, no GPU has so many registers at all. The compiler will create the array in some memory under the hood, which might explain the compiler messages you get.

So, the worst compute shader i've ever seen ;)

 

What you should do instead is to utilize LDS memory (group shared) in some way, which is usually the reason to use compute shaders.

Hey, thanks for your feedback! I agree with most of your points but I wonder if  this solution is really that bad in my specific case at least on GTX660M :P. I've run this app also on GTX1060 and this solution indeed was horrible but on GTX660M the situation is reversed - it prove to be the fastest solution. I didn't know that registers are accessed like this (it's very valuable information, thanks!) but in my case you can see the loop in which I access voxelColumnData executes constant number of times => compiler should be smart enough to unroll the loop and predict the registers from the array, right? The current approach (which runs 4 groups of 32x8 threads and each thread processes 32 voxels in depth sequentially) takes about 6ms to fill about 8 metavoxels (each of size 32x32x32) while the approach with shared memory (I tried a few configurations) yielded sth like 7ms and another approach - running 32 groups of 32x32 threads per metavoxel (each thread setups color for exacly one voxel based on particles) is significantly faster on GTX1060 but on GTX660M takes about 20 ms. Unfortunately, horrible solutions on one GPU might not be horrible on another : P though I guess I should care more for the newer hardware than my GTX660M ; ]

10 minutes ago, savail said:

but in my case you can see the loop in which I access voxelColumnData executes constant number of times => compiler should be smart enough to unroll the loop and predict the registers from the array, right?

Yeah, a clever compiler could eliminate the need for the array at all and decide to loop over each particle for each volume cell instead. But you can not rely on such compiler decisions - this leads to the wildly different performance on various hardware you see. (GTX6xx is usually terrible at compute, GTX10xx is good. But a good algorithm will be good on both.)

One solution would to do this yourself (one thread per volume cell looping over all particles). To utilize LDS, each thread would load one (or a small number) particle to LDS. Then, all other threads can load all the particles very quickly from LDS. You reduce particle bandwidth easily by a factor like, e.g. 256 this way. (One thread only loads 4 particles, but has access to all particles from the workgroup.)

Or you can use LDS to store a brick of volume. (Each thread loads each particle, but because they always load the same particle, bandwidth should be reduced anyways.) Here you save bandwidth and expensive atomics to global memory.

You can also combine both approaches or do something smarter by binning particles to larger volume bricks, etc. Depends on particle numbers / sizes and volume size what is more likely to be fastest. (Often you need to implement multiple techniques to find the best :(, if you really worry about performance.)

This topic is closed to new replies.

Advertisement