Subsurface scattering in Vulkan path tracing

Started by
68 comments, last by taby 2 months, 1 week ago

I tried making everything a vec4 but I still get garbage. I'm certain that it's something obvious. :(

Advertisement

taby said:
Should I just make everything a vec4?

Personally i do this a lot.
But it should not really matter.

There is no help in VK if you need a array of structures layout. All the addressing needs to be done manually.
OpenGL had support for structs iirc., but it was really cumbersome to use and i do not miss it.

There are functions to cast from float to int and vice versa, so using some integers in float buffers is no problem.
I never had issues with alignment.

If you're sure indexing is right, the problem maybe is related to memory allocation. But i can't remember related problems either.

I also use SSBO for debugging. I write numbers to it from the shaders, then download to CPU to display them. Maybe that's something you could try now to get better clues.

I have found your related github code in rgen:

//	rays r[buffer_size];
	int current_buffer_index = 0;

This means every ray uses the ssbo starting from index 0, so they will override and corrupt each other.

Because you do use a complex struct for 'rays', alignment might be an issue as well, since we don't know what the driver does for padding, and how large one struct actually is. Making everything vec4 would help. You could pack 4 integers into one such vec4, etc.

OK, I think I got the indexing designed. I'm going to make sure it's a good design before I start using it. Thanks for all of your help JoeJ.


size_t screenshot_x = pixel_num_x;
size_t screenshot_y = pixel_num_y;
size_t screenshot_index = screenshot_y * screen_width + screenshot_x;

Yep, this should work then.

But just fyi, another common problem is dynamic size per thread, e.g. if each of your rays would require a different size for its array.

In this case you would pre-allocate some buffer which is hopefully large enough for all, and then use atomics to do a sub allocation from that buffer per thread:

int localSize = 7; // current thread needs 7 elements
int localRange = atomic_add(globalCounter, localSize);

for (int i=0; i<localSize; i++)
	globalBuffer[localRange + i] = localStuff[i];
	
// for simplicity i did not do any checks to ensure we do not exceed the size of the global buffer

Due to the atomic instruction it's guaranteed there are no write hazards, and it's still lock free and fast.

On GPU there is no need to declare any memory (like the globalCounter) to be accessed with atomic instructions in advance. You can use them on any type of memory whenever you want.
That's really convenient. Using atomics on CPU with C++ is pretty cumbersome in comparison.

Well, that was a nice foray into the SSBO. I now know a bit more than I did before. Thanks for the ideas JoeJ. I try to make a buffer that large, but it crashes the app.

taby said:
I try to make a buffer that large, but it crashes the app.

How much memory would you need?

Anyway, i would try it with low res like 320 * 200. Good enough to too see if there is some potential win.

JoeJ said:
There is no help in VK if you need a array of structures layout. All the addressing needs to be done manually. OpenGL had support for structs iirc., but it was really cumbersome to use and i do not miss it.

This is one of the things I dislike on VK/GL, and absolutely like in D3D12 (StructuredBuffer/RWStructuredBuffer).

JoeJ said:
Because you do use a complex struct for 'rays', alignment might be an issue as well

There is definitely specification requirement for alignment. I might be wrong here - but there should be few alignment rules:

Scalar alignment

  • Scalar of size N has alignment N
  • Vector equal to component type
  • Array equal to element type
  • Structure based on largest scalar alignment in any of its members
  • Matrix equivalent to array decl.

Base alignment (i.e. std430)

  • Scalar base equal to scalar alignment
  • 2-component ~ 2 times scalar alignment
  • 3-component ~ 4 times scalar
  • Array ~ equal to base alignment of element type
  • Structure alignment equal to largest base of any of its members - note for 1-byte aligned emptry structs StorageBuffer8BitAccess/UniformAndStorageBuffer8BitAccess declared in SPIR-V module)
  • Matrix inherits base alignment from array declaration

Extended alignment (i.e. std140)

  • Scalar/vector type has extended equal to their base
  • AoS type rounded up to multiples of 16
  • Matrix type inherits always extended alignments

It sounds a bit meh unless you read thoroughly. Note that I'd recommend 16-byte alignment if your structs are somewhat larger (otherwise - I'd consider using scalars). Of course padding can hurt you memory wise.

Hardware architectures design will also kick in and there will be differences between RDNA, GCN, Turing, Ada Lovelace for example.

JoeJ said:
I also use SSBO for debugging. I write numbers to it from the shaders, then download to CPU to display them. Maybe that's something you could try now to get better clues.

I'm doing a lot more in Direct3D 12 world instead of Vulkan - so @joej advices are going to be better in relation to debugging. I don't know whether Vulkan has anything like PIX (with RT support! Because RenderDoc does have some support, but for now the RT one is mostly non-existent). PIX on the other hand allows you to do a lot more investigation and helps debugging a lot.

You can view StructuredBuffers (and others) natively - just by defining a struct defining how you want to view the memory blob. I know that RenderDoc allows that too - but last time I've used it with RT, it straight crashed (was about 6 weeks back if I remember correctly).

So - what I usually do when debugging - I declare intermediate read-write structured buffers and write temporary information in them. Then run the whole app through PIX, capture frames and investigate what happens.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
I don't know whether Vulkan has anything like PIX

When i worked with VK, Radeon GPU profiler was not out yet. So i have used CodeXL for profiling the OpenCL branch, and VK then only was a port of that.

CodeXL was great for compute. I hope current profiling tools can replace it well.
Afaik NSight, Radeon GPU profiler, and RenderDoc do support VK.

Imo, it is absolutely necessary to use such tools, otherwise you're totally blind folded and you can't know why stuff is slow. The earlier you get used to it, the more time you save. (I've used it not really for debugging, but only to see things like occupancy, register pressure, etc.)

VK/GL can do printf for debugging. But i never tried it.
My use of a buffer to get some numbers back really was the bare minimum. There should be much better ways now.

Vilem Otte said:
I might be wrong here - but there should be few alignment rules:

That's the ‘cumbersome’ stuff i remember from OpenGL. After the switch to VK i no longer needed it, and i was really happy about that.

But maybe that's just coincidence. I don't use any structs. It's all SoA memory layout. So probably i got around the hassle only because of that.

JoeJ said:
VK/GL can do printf for debugging. But i never tried it.

From my old days when CUDA was new and OpenCL was being in its baby state. NOOOOOOOO!

Debugging the code at that time was far from pleasant experience.

JoeJ said:
Afaik NSight, Radeon GPU profiler, and RenderDoc do support VK.

Thing is - they support VK (and even D3D12). But they don't support RT (VKRT, DirectXRayTracing). The moment you start doing calls for them - RenderDoc silently crashed without any message or information.

It took me quite a while to figure out what's happening back then. I've switched to PIX since.

It still misses quite some things in DirectX Ray Tracing, but at least has some support of it. I'm quite sure RenderDoc-v1.29 did NOT support anything ray tracing. I can't say for newer versions - the author has never officially said he pursues that, so it is unlikely.

Note: I didn't use AMD tools - so I also can't say for those. NSight support for it is … well … meh. Sometimes it gives you some information, mostly not. I absolutely understand that there are massive challenges regarding ray tracing and debugging (having written many various ray tracers over time).

JoeJ said:
That's the ‘cumbersome’ stuff i remember from OpenGL.

Yeah - Vulkan adopted it (std140 and std430 is the same). In D3D on my side I still have to adhere to alignment rules too (which are in no way easier).

You generally create a buffer that's DXGI_FORMAT_R32_TYPELESS or such, sets its elements to requires buffer size (divided by 4 - because you want to pass size in bytes) and work from there. Quite same as in Vulkan.

And while alignment/padding rules are not forced (there are few basic rules) - you do want to be (for larger sturctures) 16 bytes padded and also aligned. Why? Performance at cost of a bit of additional memory. The hit also isn't small (could be 20% or even more).

I often use structures like these:



/// <summary>
/// Single record in shadow map texture (virtual/atlas)
/// </summary>
struct __declspec(align(16)) ShadowTileRecord
{
	/// <summary>Shadow map texture projection matrix</summary>
	Engine::mat4 mMatrix;
	/// <summary>Texture coordinate origin (offset) in shadow map texture atlas</summary>
	Engine::float2 mOffset;
	/// <summary>Texture coordinate size in shadow map texture atlas</summary>
	Engine::float2 mSize;
};

Of course you could just dump it either into float4 buffer, or use 3 buffers - one for matrices, another for offsets and another for sizes.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

This topic is closed to new replies.

Advertisement