Vulkan bidirectional path tracer

Started by
17 comments, last by taby 6 months, 2 weeks ago

I'm writing a bidirectional path tracer using Vulkan and C++/GLSL.

The code and model file are at: https://github.com/sjhalayka/bidirectional_path_tracer

As you can see, there is a spurious light source from outside of the box. It's not supposed to be like that.

Anyone with experience in bidirectional path tracing?

Advertisement

You probably have the ray return white (vec3(1.0)) when it doesn't hit anything. This adds illumination to the scene. You should make it return black (vec3(0.0)) instead.

The problem is here:

All of these cases are wrong. The first one should return the color of the sky (if it is treated as a light) multiplied by local_colour, or return black if you want sky to be black (to remove the light outside the box). The second one should return black because the path is not complete (same result as if it didn't hit anything). The third one should return local_colour multiplied with the light color and intensity. The way you are detecting hitting a light is hella sketchy.

Aressera said:
The way you are detecting hitting a light is hella sketchy.

If you look at closesthit.rchit it makes sense why it is compared this way. But it is weird to me (my bet is just a quick debug setup for light with intensity of 20).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Actually, you're looking at the wrong part of the code altogether. That code that you have a problem with works just fine – it's the forward/bidirectional code that doesn't work. I should have specified which line earlier.

The problem function is at: https://github.com/sjhalayka/bidirectional_path_tracer/blob/cb3069945a1feb4bade359a990de7c43a3770d4a/raygen.rgen#L501

Thanks for your time, in any case.

Can't help on the issue, but reading the code gave me some optimization idea.

I see you use an array to store path vertices from a light.
That's bad on GPU, because the array exists in registers not memory, so addressing by index is not possible. Instead the GPU has to process multiple branches to find the register matching the index, and indexing a local array is not O(1).
Branching will cause execution divergence, but worse: Storing array in register causes register pressure, hurting performance of all other shaders as well.

The usual solution is to use LDS memory instead registers. Maybe the compiler even does this, but i doubt so and that's not my point.
To use LDS memory, you have to use compute shader instead ray generation shader. And by tracing from compute, you may disable specific HWRT optimizations of NV and Intel GPUs (binning / sorting rays to saturate hit shaders), while for AMD it's often faster, iirc.

That said, here's my idea. Which - if it works - should give a bigger win than above concerns.
Say we use a workgroup of 256 threads / rays, formed by a 16*16 block of pixels.
Pixels are likely close to to each other with similar normals, so it's also likely they see the similar parts of the scene.
We store a light path for each pixel in LDS memory. This means that each ray has access to all paths of the workgroup.
So you could share a lot of work to get much more samples, despite the need to trace one ray to each vertex we might connect to.

Not sure if that's a new idea, but sounds worth to investigate.

JoeJ said:
I see you use an array to store path vertices from a light. That's bad on GPU, because the array exists in registers not memory, so addressing by index is not possible. Instead the GPU has to process multiple branches to find the register matching the index, and indexing a local array is not O(1). Branching will cause execution divergence, but worse: Storing array in register causes register pressure, hurting performance of all other shaders as well.

Having flashbacks of my thesis.

Over the time, and over different architectures - I've tried multiple approaches (keep in mind, most of that was in compute, I didn't ever integrate VulkanRT into our projects (although DXR is heavily WIP currently - so far, it seems to end up in some increase of performance … not that significant though). Anyways back to the topic.

There can be multiple approaches taken when doing bidirectional path tracing - you have to trace path from the light and also from the eye. This is followed by merging the path (which can be done at endpoints, or even explicitly between vertices of the paths. The paths can be generated within a single pass or as a multi pass approach. The integration can also utilize Metropolis-Hastings - which often increases early convergence rate. Few notes:

  • Tracing both paths within single pass often causes problems (some threads take significantly longer than others … this also applies for each path separately, you can either mitigate this with persistent thread approach, or iteratively create just single path step in single pass … storing results in buffers) - this applies double if you do an unbiased version with Russian roulette!.
  • Joining both paths with single step generally keeps occupancy higher, but convergence rate is often significantly lower. Joining with multiple steps results in worse occupancy. Persistent thread approach does help here. Downside of more complex joining is - that you often need to store results in structured buffers (LDS could work under some constraints)
  • For optimization it is possible to pre-generate light paths (you need a way to somehow request new - assuming you want an unbiased version). Keep in mind though that dynamic objects and lights will require re-generation of impacted light paths. This can improve computation time significantly. Also, it is possible to extend this with Metrpolis, which (often) significantly improves early convergence rate.

JoeJ said:
The usual solution is to use LDS memory instead registers. Maybe the compiler even does this, but i doubt so and that's not my point.

I won't go into rant against compute compilers… for compiler “optimization” topic, let me just share this part of my code (from traversal … I've looked at the generated assembly, it's “almost” the same - maybe I could send this to AMD):

	// Traversal
	[loop] for (i = 0; i < 1000; i++)
	//while (node_id != 0xFFFFFFFF) // This is significantly slower (factor of 10+) on RDNA/RDNA 2, not properly tested on NV and Intel
	{
		...
		
		if (node_id == 0xFFFFFFFF)
		{
			break;
		}
	}

So, no, I wouldn't trust compiler to do optimizations with LDS properly. When it can't even optimally compile loops (I know it's a harsh statement, but I almost got bald from this loop). This being said - old OpenCL kernel doing the while loop and for loop yields the same performance.

JoeJ said:
That said, here's my idea. Which - if it works - should give a bigger win than above concerns. Say we use a workgroup of 256 threads / rays, formed by a 16*16 block of pixels. Pixels are likely close to to each other with similar normals, so it's also likely they see the similar parts of the scene. We store a light path for each pixel in LDS memory. This means that each ray has access to all paths of the workgroup. So you could share a lot of work to get much more samples, despite the need to trace one ray to each vertex we might connect to.

The convergence of 16x16 block of pixels end after primary ray. It is a path tracing after all - the rays will go randomly (well… importance sampling plays a role there) and also terminate at random length (unbiased - therefore Russian roulette). That counts for BOTH, the light path and the eye path. The only way you can guarantee convergence (which I did in my thesis) is:

  1. Generate rays for a step
  2. Sort rays in a way to prioritize direction of ray and proximity of origin (rays have to store which pixel they map to!)
  3. Perform traversal for rays and update results (explicit step), perform Russian roulette
  4. Continue with step 1 (assuming there are still non-terminated paths … you can introduce bias by max. amount of passes)

This is performed for both light and eye paths (light paths results are vertices for given light path).

What you suggest is to trace light path within a “tile” (16x16 block) and then you could re-use all light paths when the same block (within same kernel!) traces eye path. This is possible - it will be a mega-kernel like approach and occupancy will be lower (paths have varying lengths!). What could improve occupancy though is (maybe an idea) - tile creates light paths (and instead of sync - the threads that finished early could possibly create another light path … similar to speculative while-while traversal approach from Aila/Laine back in 2010s - and what they did with postponing leaf node in BVH and searching for another one).You could do the same for eye path. This would improve occupancy. I'm probably going quite far into details here though - and it's just an idea that may or may not work … sadly I don't have time to try it right now.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Hey JoeJ, thanks for your time and expertise.

The only reason that I did the things this way because there is no recursion in Vulkan/GLSL.

taby said:
The only reason that I did the things this way because there is no recursion in Vulkan/GLSL.

You can simulate recursion with stack. That's not a problem (which is the common way how you do a traversal through tree).

Anyways - to your question … let me copy the code here and add comments with questions (because it doesn't make that much sense to me):

// This function takes in - eye is camera position, direction is primary ray direction, correct? What's the purpose of this function? Perform eye-path path trace? Perform light-path path trace? Or?
//
// Digging through this - this seems to do whole bidirectional path...
float trace_path_forward(const vec3 eye, const vec3 direction, const float hue, const float eta)
{
	// The following part selects a random light, then selects a random point on light and then selects random direction (cosine-weighted) ... is this correct?
	// Why is this even done here? This makes sense only for light path...
	const vec3 mask = hsv2rgb(vec3(hue, 1.0, 1.0));

	Vertex A, B, C;

	int index = int(round(stepAndOutputRNGFloat(prng_state)*float(ubo.light_tri_count - 1)));

	get_triangle_vertices_by_light_index(A, B, C, index);

	const vec3 light_o = getRandomPointOnTriangle(A.pos, B.pos, C.pos);
	const vec3 light_d = cosWeightedRandomHemisphereDirection(get_normal_by_light_index(index), prng_state);

	float area = get_area_by_light_index(index);

	const float energy = area;


	
	// So... here you start with an eye path ... step_locations should contain vertices along eye path, correct?
	// If we hit light, we return - otherwise we continue in path.
	uint step_count = 0;
	vec3 step_locations[max_bounces + 3];

	step_locations[step_count] = eye;
	step_count++;

	traceRayEXT(topLevelAS, gl_RayFlagsOpaqueEXT, 0xff, 0, 0, 0, eye, 0.001, direction, 10000.0, 0);

	if(rayPayload.dist != -1.0)
	{
		vec3 hitPos = eye + direction * rayPayload.dist;

		step_locations[step_count] = hitPos;
		step_count++;

		if(	rayPayload.color.r == 20.0 && 	
			rayPayload.color.g == 20.0 && 
			rayPayload.color.b == 20.0)
		{
			float local_colour = (rayPayload.color.r*mask.r + rayPayload.color.g*mask.g + rayPayload.color.b*mask.b);
			return local_colour;
		}
	}
	else
	{
		// If hit the sky
		return 0.0;
	}

	// Wait ... what is this. So - now you build the first vertex on light path (which is point on the light)
	//
	// Then you attempt to perform a merge between end of eye-path and sampled point on light path (i.e. like simple explicit sample). 
	// And therefore connect light and eye paths.
	//
	// If connection is found - you continue?
	// If no connection is found - you iteratively bounce light path and test if there is connection?
	bool found_path = false;

	step_locations[step_count] = light_o;
	step_count++;

	if(true == is_clear_line_of_sight(step_locations[1], light_o))
	{
		found_path = true;
	}
	else
	{
		vec3 o = light_o;
		vec3 d = light_d;

		for(int i = 0; i < max_bounces; i++)
		{
			traceRayEXT(topLevelAS, gl_RayFlagsOpaqueEXT, 0xff, 0, 0, 0, o, 0.001, d, 10000.0, 0);

			// If hit the sky
			if(rayPayload.dist == -1.0)
				return 0.0;

			vec3 hitPos = o + d * rayPayload.dist;

			step_locations[step_count] = hitPos;
			step_count++;

			if(true == is_clear_line_of_sight(step_locations[1], step_locations[step_count - 1]))
			{
				found_path = true;
				break;
			}

			o = hitPos + rayPayload.normal * 0.01;
			d = cosWeightedRandomHemisphereDirection(rayPayload.normal, prng_state);
		}
	}

	// Uhm... in case you can't connect eye path and light path ... you continue tracing eye path? Returning its color directly?
	//
	// This doesn't make sense to me, why would you even do that (with BRDF being different! and scattering (while no scattering happens here!!!))? 
	// How do you expect to hit a tiny light in this case when you can't find connection with any vertex on the light path?
	//
	// I don't think this should be here
	if(found_path == false)
		return trace_path_backward(max_bounces, eye, direction, hue, eta);



	// ???
	//
	// Your buffer is laid in a way:
	// [eye, 1st eye-path vertex, 1st light path vertex, 2nd light path vertex, ...]
	//
	// you reverse the light path here - correct?

	// Reverse the last part of the path
	uint start = 2;
	uint end = step_count - 1;

	while(start < end) 
	{ 
		vec3 temp = step_locations[start];  
		step_locations[start] = step_locations[end]; 
		step_locations[end] = temp; 
		start++; 
		end--; 
	}


	// So... at this point you have eye path (just origin and single vertex) and light path that surely connects with first vertex on eye path (endpoint))
	//
	// So at this point you perform merge - why are you re-tracing rays between points on eye path and light path. You have to merge paths, not re-trace them.
	// Also, you already performed endpoint connection in is_clear_line_of_sight - why tracing ray again, that is expensive operation
	//
	// Merging paths is the tricky part - you can either merge only endpoints (I'd start with that one as it's easy - it's basically the same as doing explicit
	// step in standard path tracing), then start looking at how to merge multiple vertices

	float ret_colour = 0;
	float local_colour = energy;
	float total = 0;


	for(int i = 0; i < step_count - 1; i++)
	{
		vec3 step_o = step_locations[i];
		vec3 step_d = step_locations[i + 1] - step_locations[i];

		traceRayEXT(topLevelAS, gl_RayFlagsOpaqueEXT, 0xff, 0, 0, 0, step_o, 0.001, step_d, 10000.0, 0);

		local_colour *= (rayPayload.color.r*mask.r + rayPayload.color.g*mask.g + rayPayload.color.b*mask.b);
		
		total += mask.r;
		total += mask.g;
		total += mask.b;

		// If hit a light
		if(	rayPayload.color.r == 20.0 && 	
			rayPayload.color.g == 20.0 && 
			rayPayload.color.b == 20.0)
		{
			ret_colour += local_colour;
			break;
		}
	}

	return ret_colour / total;
}

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

taby said:
The only reason that I did the things this way because there is no recursion in Vulkan/GLSL.

Oh, how interesting. I can live with that. Recursion is for noobs. :D

Vilem Otte said:
it will be a mega-kernel like approach and occupancy will be lower (paths have varying lengths!).

I need to guess…
You say Russian Roulette is needed to avoid bias. Is this because RR could - in theory(?) - generate paths of infinite lengths to get infinite bounces? Can't be. (I'll never know the precise meaning of ‘biased’ in PT : )
Let's say we support 4 bounces, and so no need for RR. Further assuming each traceRay call takes a similar amount of time, threads should not wait on each other too long.

Vilem Otte said:
The convergence of 16x16 block of pixels end after primary ray. It is a path tracing after all - the rays will go randomly

This applies to individual samples, but not to the entire signal we try to integrate. We can imagine the signal as an environment map seen from the cluster of our pixels with similar normals.
The light paths give us some information about this environment. A path that is good for a single pixel, likely is good for all pixels in the cluster. That's what i've meant, not the other problem caused be random eye path after the first hit.

Sounds you have tried this idea, but with a larger number of light paths stored in VRAM. At a larger scale, it reminds me on Photon Mapping a bit.

I guess the problem with my proposal is: While it may be faster, it needs to much base performance to make sense for real time.
In Tabys example there are two paths per pixel, which is already a big ray budget. If every pixel needs to trace one more ray to 63 other light path vertices, or even #bounces times 63, that's surely too much. : )

But how is it if we do only one light path, eventually considering our cluster center and normal cone to get better contribution, and then we use that one path for all of our 64 pixels?
Ideally there's a need to store in VRAM then, so i guess that's somehow what you did?
Randomly shifting the tile grid each frame would help against temporal blockyness i would hope, and there also is no need for compute shaders.

JoeJ said:

taby said:
The only reason that I did the things this way because there is no recursion in Vulkan/GLSL.

Oh, how interesting. I can live with that. Recursion is for noobs. :D

Now, that's not what I meant! LOL ; )

This topic is closed to new replies.

Advertisement