Writing to an array of floats in a ray generation shader and accumulate result in a compute shader

Started by
6 comments, last by FrEEzE2046 2 months, 2 weeks ago

Before I describe what I've tried, let me better describe what I'm actually trying to achieve:

I have two ray generation shaders s1 and s2. The purpose of s1 is to compute n float values vals[0 … n - 1]. After s1 has finished execution of all invocations, s2 needs to known the sum acc = vals[0] + … + vals[n - 1].

I'm having a hard time to get this working …

So, what I've tried is declaring RWTexture1D<float4> vals : register(u0);. The issue with that is that n is usually round about 100.000. When I try to create the corresponding resource, I get an error claiming that the width of a D3D12_UAV_DIMENSION_TEXTURE1D cannot be larger than 25.000. I've also tried to replace RWTexture1D by RWBuffer<float>, but the error is similar. So, part of my question is what I should do about this. I don't really need to store all the vals. As I wrote above, I only need to know acc in the end. So, if necessary, I could compute multiple values in a single invocation of s1 and store the sum of them in vals. I could clearly also use DispatchRays() with width = height = depth = 1 so that there is only one invocation of s1 and compute the whole sum in that invocation. However, that would be rather inefficient and I would make no use of parallelization at all.

Anyways, assuming I managed to fill vals, I actually wondered how I can compute acc now. I clearly don't want to compute this in every single invocation of s2. So, I thought I need a compute shader c1 for that. If there is any better option, I'm also curious to here that. I'm quite a bit lost about how I should specify numthreads (I've never used compute shaders before).

Any help is highly appreciated!

Remark: In case this is a helpful information: s1 and s2 are not executed one after the other in every frame. s1 is only executed before s2 if something in the scene (or the camera position) has changed. That is, the value acc stays constant as long nothing what the camera sees has changed.

Advertisement

Why two shaders?

The accumulation in the ray gen shader is quite small.

	// Step two: this is the Fresnel reflection-refraction code
	// Start at the tips of the branches, work backwards to the root
	for(int i = current_buffer_index - 1; i >= 0; i--)
	{
		bool pure_refraction = false;
		bool pure_reflection = false;
		bool neither = false;
		bool both = false;

		if(rays[i].child_refract_id != -1 && rays[i].child_reflect_id == -1)
			pure_refraction = true;

		if(rays[i].child_refract_id == -1 && rays[i].child_reflect_id != -1)
			pure_reflection = true;

		if(rays[i].child_refract_id == -1 && rays[i].child_reflect_id == -1)
			neither = true;

		if(rays[i].child_refract_id != -1 && rays[i].child_reflect_id != -1)
			both = true;

		float accum = 0.0;

		if(neither)
		{
			accum = rays[i].base_color;
		}
		else if(both)
		{
			// Fake the Fresnel refraction-reflection
			const float ratio = 1.0 - dot(-normalize(rays[i].direction.xyz), rays[i].normal);

			float reflect_accum = mix(rays[i].base_color, rays[rays[i].child_reflect_id].accumulated_color, rays[i].reflection_constant);
			float refract_accum = mix(rays[i].base_color, rays[rays[i].child_refract_id].accumulated_color, 1.0 - rays[i].refraction_constant);
		
			accum = mix(refract_accum, reflect_accum, ratio);
		}
		else if(pure_refraction)
		{
			accum = mix(rays[i].base_color, rays[rays[i].child_refract_id].accumulated_color, 1.0 - rays[i].refraction_constant);	
		}
		else if(pure_reflection)
		{
			accum = mix(rays[i].base_color, rays[rays[i].child_reflect_id].accumulated_color, rays[i].reflection_constant);
		}


		// Do tinting
		const vec3 mask = hsv2rgb(vec3(hue, 1.0, 1.0));
		const float t = rays[i].tint_colour.r*mask.r + rays[i].tint_colour.g*mask.g + rays[i].tint_colour.b*mask.b;
	
		float x = accum;
		accum = mix(x, t, rays[i].tint_constant);
		accum = min(x, accum);


		rays[i].accumulated_color = accum;
	}

@taby The two ray generation shaders perform different tasks. If I perform the accumulation inside s1, wouldn't that be highly inefficient? There is one thing which I may have not written clear enough: I'm invoking s1 by a DispatchRays call with width = n and height = depth = 1. If I follow your suggestion, wouldn't I need to set width = 1 as well (so that there is only a single invocation of s1 at the end), since I already generate all n values inside that single invocation?

FrEEzE2046 said:
Anyways, assuming I managed to fill vals, I actually wondered how I can compute acc now.

I neither know RT nor DX12, and actually i wonder why you have problems to allocate a memory buffer large enough to store your results. There should be a way, so even without specific help i would research this topic further.
Oh - now i get it. You try to create an image which is larger than 25K x 25K?
That's ofc. a lot and just too much. Why do you need this? Something like many AO samples per pixel?

However, you could do the accumulation with atomic operations, to reduce your memory from N x samplecount to just N.
Atomics to VRAM are expensive, but you also no longer need a CS to sum it up, so it's maybe a win.

NV supports floating point atomics, but AMD still does not i guess.
Though, idk if DX exposes this feature, as it's an extension in Khronos APIs afaik.
So you might need to use integers eventually, which usually works if we care about precision.
64 bit atomics are possible as well on modern HW, but ideally you get away with 32 ofc.

FrEEzE2046 said:
I'm quite a bit lost about how I should specify numthreads (I've never used compute shaders before)

It should be at least 32 threads on NV and 32/64 on AMD. Otherwise many threads of a SM/CU remain idle and do nothing.

Typical sizes are thus 64, 128, 256. Larger workgroups up to 1024 work as well, but reduce occupancy, so that's used only where needed.

To find the best size, profiling all options is the best way, eventually finally requiring different sizes for different HW.

@JoeJ No, not larger then 25k x 25k; only larger than 25k x 1 (It's a 1-dimensional texture). I only need memory which can store 100k floats.

This is the error I'm receiving:

ID3D12Device::CreateUnorderedAccessView: The Dimensions of the View are invalid due to at least one of the following conditions. Assuming this Format (0x29, R32_FLOAT), FirstElement (value = 0) must be between 0 and the maximum offset of the Buffer, 24999, inclusively. With the current FirstElement, NumElements (value = 100001) must be between 1 and 25000, inclusively, in order that the View fit on the Buffer

FrEEzE2046 said:
I only need memory which can store 100k floats.

Oh, ok. Maybe you can use some other kind of memory buffer.
With Khronos, there is just textures or plain memory, but DX seems more varied and confusing here.

I'm pretty sure i have used larger memory buffers with Vulkan 10 years ago already, and it was no problem.
So this should be possible for you too. Maybe somebody will reply with better help…

The solution for me was to use CreateBuffer instead of CreateCommittedResource. If someone can explain why this is necessary (and how they differ internally), I'd be highly interested.

This topic is closed to new replies.

Advertisement