SVOGI Implementation Details

Started by
121 comments, last by Josh Klint 1 year, 7 months ago

Vilem Otte said:
Sorry, I had to!

Your ‘real time’ (meaning ‘interactive’) path tracer will commit suicide after it sees my (truly) realtime surfels outperforming it on a lame iGPU! >:D

But i'll lack high frequency details, so i really would like to play with HW RT as well. If i could. Which i can not, because those amateurs think static topology meshes would be good enough for games forever.
I want fine grained LOD somewhat similar to Nanite, which DXR can not handle at all, as long as BVH is blackboxed.
And why should i constantly build BVH on streaming in stuff at all? I already have a BVH made for RT, which i load from disk. So i could just convert my streamed BVH to HW format, which would be much faster.

My rant shall not end until this embarrassing failure of an shitty API has been fixed.

Advertisement

@joej I'll oppose it that it indeed is realtime on guns like Radeon 6800, Radeon 590, GeForce 2070 or GeForce 3070 … although those are by no means iGPUs!

I wanted to play with HW RT, I do have hardware for it. I looked up the examples. I've never integrated it into my engine - I dislike the API.

You see, in my api I chose a different approach - I generally mostly hide things from user (unless he chooses to dig deeper in the api - but at that point you can always start looking at the source of actual library).

Scene contains a huge buffer for geometry, which is hidden from user all the time. You call scene object to request space from this buffer (which I call Node) and flag whether node is Static or Dynamic - which is important for the user, as this is a way how to fill scene with geometry. Like:

OpenTracer::Node* node = scene->CreateNode(mesh.GetSize(), OpenTracer::Node::Type::DYNAMIC);
node->SetData(mesh.GetData(), mesh.GetSize());

There is a another object called Aggregate (or Acceleration Structure if you wish). It has to be configured and its type has to be stated. It also has to link to scene, like:

OpenTracer::Aggregate* as = new OpenTracer::Aggregate(OpenTracer::Aggregate::AGGREGATE_MULTILEVELBVH, scene, "MultiBVH.conf");

Note: It's not a member of scene as you can have different accelerator structures on single scene, this was mainly done for purpose of testing. Not to mention there are multiple acceleration structures implemented (I'll further describe only multi level as that is the only relevant for dynamic scene applications). Upon instantiation it creates a list of bottom level acceleration structures, one for each node, and those that are marked static are built only once and never again automatically (there is an option to load them from file). The ones marked dynamic are automatically rebuilt each time you call Refresh on Aggregate object. Top-level acceleration structure is rebuilt each frame. (Fun fact, in CPU context you could possibly mix acceleration structures - like top-level BVH and bottom-level KDtrees - on GPU context you'd need a separate kernel for that).

The implementation is done with idea that user doesn't need to pay attention to acceleration structures at all. You literally work like this:

while (running)
{
	...
	
	as->Refresh();
	raygen->GeneratePrimary(......);
	renderer->Render(scene, as, raygen, image);
}

You don't need to rebuild acceleration structures at any point, you don't need to manually build it for objects upon adding (internally the api takes care of it), you don't need to explicitly refresh them per object (internally the api takes care of it). This is almost all the time hidden from the user.

The main downside right now is that this whole library and api uses OpenCL for hardware rendering, while my game engine uses Direct3D 11. I'm considering for quite some time moving this over to allow for better interoperation (like stream out from geometry shader directly into memory reserved for node). The main reason why I didn't do that is probably me being way too lazy and way too occupied with other projects. I still have a note in my todo list to do this over few weekend evenings.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
You don't need to rebuild acceleration structures at any point, you don't need to manually build it for objects upon adding (internally the api takes care of it), you don't need to explicitly refresh them per object (internally the api takes care of it). This is almost all the time hidden from the user.

Basically that's true for DXR as well. Maybe your's provides more abstraction by even doing a rebuild automatically, but it's exactly such abstractions which prevent me from using it.
And i did not even see that coming, initally, after RTX announcement and looking up the first release of DXR. I have criticized RTX from the start, because i felt it prevents us from doing further research to come up with efficient raytracing. Only NVidia could do this from now on, since they build the fixed function units, and nobody else has access. Thus nobody has reasons to work on raytracing anymore, and progress would stagnate. (You do it anyway, which is crazy, but you earn my respect ;D )

Only a bit later i realized i'm actually right. But the problem sits way deeper than being blocked from solving the random memory access problem of a traceRay function…

Years before, people often said: ‘Raytracing is not realtime, because building the acceleration structure is too expensive for complex scenes. You can't do that each frame, if the scene is dynamic.’
Then i replied: ‘Bullshit, dude! This can be solve easily: You prebuild BVH for all your models offline. At runtime you do this per frame: Refit BVH for characters, then build a small BVH over all your models. This won’t be many, so building those top levels is little work and totally realtime.'
And this is exactly how DXR deals with the problem as well. Nothing wrong with that? Actually nice?
No. It's not enough. We forgot to think about LOD, which beside visibility is one of the two key problems in computer graphics.
And we have good reasons to have ignored LOD, because we have a rich history of being incompetent to solve it. Actually that's caused by GPUs. Shiny had their Nanite already in 2000, and we had dynamic LOD algorithms for terrains as well. But it was not efficient to upload geometry to GPU each frame. It was faster to be silly and lazy, using more triangles than needed, but uploading them only once and letting the GPU brute force the problem away.
So we stopped working on LOD at all. We accepted a non solution of discrete LODs to be good enough. And decades later, Epic proofed us all fools by shockingly showing off what we have missed out, without even noticing.

At this point we should realize that we took the bait in form of GPUs to lure us into the wrong direction.
Graphics programmers were once known to be optimization experts, creative and innovative, never hesitating to work on hard open problems, like hidden surface removal.
But then, after GPUs? No more creation of portals while rendering front to back as seen in Quake, no more bump mapping as seen in Outcast, no more progressive LOD as seen in Messiah.
We lost ability to tackle open problems, and became close to the metal low level optimizers. To catch up with progress, all we did was reading the latest NVidia papers, which teached us how to do brute force most efficiently.
We took the bait and we were grateful in accepting the only way forward is fixed function, and increasing teraflop numbers. Helping them to sell bigger, bigger, and even bigger GPUs.

And gamers took the bait too. They believed in the graphics gods at NV just as much as we did. And to afford bigger and bigger GPUs, they became miners.

What a sad story, no?
It's the fucking truth.

But now back to my problem, proofing why awesome DXR prevents us from solving the LOD problem.
We take Nanite as an example.
Nanite has a BVH for it's meshes, because obviously LOD is a problem of hierarchical detail. Refining detail means to descend a branch of a tree. Reducing detail means to stop at an internal node. Each node stores a patch of geometry at its certain level of detail.
Do they calculate this hierarchy on loading the highest resolution model? As a nicely abstracted background process? Taking just a minute of processing time? Just like DXR does for its BVH?
No. Of course they precompute this, and load only from disk what they actually need, without any back ground processing. It's not that they are totally stupid.

DXR isn't stupid either, one may think. They need the abstraction so every GPU vendor can do his own custom BVH format suiting its HW. I see that.
And the price to pay simply is: We on PC build our BVH on GPU at runtime. During the game, each time we stream in some new stuff. This is stupid. But let them just pay premium for big GPUs. All those cores need some work. PC master race can afford to spend some cycles. They have plenty of it. No matter how much redundant work they do, they will still have one FPS more than PS5, which really is all that matters to them.

So far so good. What's my problem then?
The problem is: Once your mesh exchanges a small patch of its geometry with a lesser or higher detailed version, it's topology changes.
This breaks the DXR BVH. It has to be rebuild from scratch. You can not refit it, you have to rebuild it completely, because of this gradual change any progressive or continuous LOD solution requires to work.
Notice: As we move through the scene, all your models will change some sections of their mesh to fit detail to screen. That's the idea of a proper LOD solution. Any LOD solution, not just Epics.
Result: Your whole scene changes. You need to rebuild BVH from scratch for your entire world. Complete rebuild is the only option DXR provides for this case.
Can your awesome RTX 3090 do this? No. Not even 10 of them in a mining rig could do this in time.
Can you shove your shiny RTX 3090 up your ass? Yes. And that's exactly what you should do with it. I would. I never requested Tensor Cores. No game dev did. We can do temporal upscaling without ML, UE5 is again a good example to proof this. And RTX i can not even use, because the self appointed experts at NV and MS headquarters were too busy by getting high on their own farts from porting Optix to DirectX, and they forgot about LOD due to incompetence or ignorance.

To allow us to solve this problem, we need one of two options:
1. GPU vendors expose their BVH data structures with vendor extensions. So we can build and modify it ourselves. We are not too stupid to do this, even after decades of taking the bait.
2. Make a BVH API to expose it by abstractions all GPU vendors can agree. That's difficult maybe.

None of these options will happen anytime soon. It will take years, more likely a decade.
So i ask you: Does DXR spur progress? Or does it dictate stagnation?

It's the latter. The most efficient way to do raytracing over my geometry will be a compute tracer. Slow tracing, but the cost of BVH building is zero.
But i will not trace triangles, just surfels. So again no sharp reflections or hard shadows. As my GI already provides environment maps to look up specular, i'm not sure if it's worth it at all.
It's crazy. But i think LOD is more important than raytracing, and they force me to choose between the two.

Some people now will ask: ‘Hey, but UE5 does use RTX! They even migrated their whole Lumen stuff to HW RT, so why do you say they can't use it with Nanite?’
The answer is, they do quite the same compromise: They use a static low poly version of their models for tracing without LOD. But they can not generally trace the awesome details they can rasterize. And they also need to build BVH for RT on GPU for those low poly models, although they could eventually just convert their own format. Brian Karis also criticized DXR for its shortcomings.

Now i hope some more companies will do their version of Nanite, so people finally understand the broken state of HW RT and request solutions.

JoeJ said:
Basically that's true for DXR as well. Maybe your's provides more abstraction by even doing a rebuild automatically, but it's exactly such abstractions which prevent me from using it.

The main difference between DXR and my implementation is that while you have abstraction layer to use the ray tracer, you do have the actual ray tracer as open source project next to it (which builds dll, that's used by the application). The only downside is, that opencl sources that are required to run have to be placed next to executable (this being said I could just move those inside dll - but I don't want to, due to external files being much easier to edit).

You mentioned writing custom acceleration structure rebuild/refit? Doable, without a problem - you simply add another Aggregate implementation and perform the logic inside based on flags on Node for which you're processing acceleration structure now.

Regarding graphics programming:

Many years back demoscene was still somewhat active, I recall some gems like Nature Still Sucks https://www.pouet.net/prod.php?which=9461​ … or F.e. Outcast game (I believe you can grab it for like 2 euros on gog). These got me into 3D graphics (together with my favorite games like Morrowind (everyone remembers the water!), first 2 games in Gothic and yes, Half Life games). I literally learned NeHe, crawled Flipcode (while on first years on 8-year college here, and my English was seriously bad), and I tried to do graphics demos with various effects - that's how I started spamming GameDev.net (I still miss IotD and long discussions under it). It was the time you had to abuse vertex and pixel shaders to achieve some sort of compute.

The moment GPU vendors introduced compute (with GeForce 8000 series if I remember correctly) I was like a little kid that discovered sandbox - on some old machine I might even have my Demos folder (it had hundreds of projects). I honestly was extremely disappointed with DXR/RTX the moment they proposed it. I saw no point in it at all.

From years of experience (not just in graphics programming) - any fixed function yields stagnation, any programmability yields progress and evolution.

DXR api:

Exposing acceleration structures as you hope is just not going to happen. First of all, it would take years or more. Second of all, it would require new hardware (with DXR 2.1 or whatever version to support). DXR itself extremely reminds me of one thing in the past - physics accelerator (Ageia PhysX) - it promised huge scale physics simulation, but ended up going bust. Instead there are numerous physics simulators now using GPU compute for optimization.

The downside is, unlike Ageia, NVidia was actually capable of selling RTX to customers. With mostly minimal image difference as most on those games have already used effect hacks that try to approximate things like GI, reflections, refractions, etc.

Note My apologize to @joshklint we got a bit side tracked from topic. Hope he doesn't mind too much. I originally had a post ready earlier in afternoon with math heavy equations explaining how cone tracing tries to work (and how he could mitigate the wrong effects a bit) - but I just didn't like it, too many equations and untested results. I'll have to honestly try one thing to make sure I can post in regard to that topic.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
The main difference between DXR and my implementation is that while you have abstraction layer to use the ray tracer, you do have the actual ray tracer as open source project next to it (which builds dll, that's used by the application). The only downside is, that opencl sources that are required to run have to be placed next to executable (this being said I could just move those inside dll - but I don't want to, due to external files being much easier to edit). You mentioned writing custom acceleration structure rebuild/refit? Doable, without a problem - you simply add another Aggregate implementation and perform the logic inside based on flags on Node for which you're processing acceleration structure now.

Sure. With a software raytracer, i could do whatever i want, and supporting my LOD would just work.

But you should add DXR support as well. It will run 10 times faster than your shaders, making clear it's pretty pointless to develop software RT.
I mean, thanks to chip crisis, it likely still takes 5 years until we can put RT on minimal specs. The crisis is even fortunate for those having worked on this, like Crytek.
But had they known HW RT acceleration comes up, they surely wouldn't had spent years on developing this just for some years.

It would have been nice from NV to inform us upfront about their plans, and eventually even showing beta APIs to discuss this with devs.
There was a small circle of devs - Dice, Remedy, A4 Games. But none of them spotted the issue - they just adopted the feature to their traditional, LODless renderers. I've talked a lot about this with the A4 boss on b3d forum (i'm just assuming his ientity), but he clearly was in a RTX defense position, constantly trying to minimize my critique and arguments. Either he indeed does not understand how LOD works, or he has personal interest on keeping the RTX hype intact (which makes sense). BTW, one of my early proposals on the forum was to achieve photorealism by path tracing, but trace only one segment of the path, then look up the rest of the paths infinite segments - already integrated over the whole halfspace - from dynamic GI probes using another algorithm like mine. That's a pretty obvious solution, but maybe it's no coincidence the second version of Metro RTX did precisely this a year later. :D
Though, afaik Epic was on board as well. They surely have complained, but likely NV simply did not want to expose their data structures. I'm pretty sure it's this, and they hinder progress being fully aware about it, to keep control and flexibility on their side.
Andrew Lauritzen, now at Epic, also was there, and he gave me some backing finally after UE5 presentation, which felt like a big consolidation in my fight against industry experts.
Looking back, the discussions were heated but on a high level. I had introduced myself as hobby programmer, but still they took me serious, which is quite fine. : )

It was the time you had to abuse vertex and pixel shaders to achieve some sort of compute.

Haha, i never wanted to do this. I started serious GPU work only after compute was out. So i'm totally not the experienced gfx programmer some people think i am. I really only have good exp. with compute, but not with gfx pipeline.

From years of experience (not just in graphics programming) - any fixed function yields stagnation, any programmability yields progress and evolution.

Yeah, man! And one would think they have learned this after all those years?
But, no they didn't. DXR would be ok in a high level API like DX11. From a low level API like DX12 i really would expect to expose the most important data structure of raytracing.

Vilem Otte said:
Exposing acceleration structures as you hope is just not going to happen. First of all, it would take years or more. Second of all, it would require new hardware (with DXR 2.1 or whatever version to support).

At this point, i would ditch MS and Khronos in favor of vendor APIs without looking back. VK still isn't on par with Mantle in terms of flow control on GPU, but Mantle is far easier and less complicated overall.
From my perspective, even supporting 4 vendor APIs would be less work than working a round using a hopeless trial to do abstractions.

Intel already confirmed their support of Traversal Shaders, so we might soon see Turing and Ampere lacking the latest API features, while AMD can still implement them along their flexible (but still blackboxed) compute solution.
Sadly all the people see is that NV is faster… ; )

JoeJ said:
We lost ability to tackle open problems, and became close to the metal low level optimizers. To catch up with progress, all we did was reading the latest NVidia papers, which teached us how to do brute force most efficiently.

Oh man, that is so true. XD

Vilem Otte said:
Note My apologize to @joshklint we got a bit side tracked from topic. Hope he doesn't mind too much.

Not at all, this is a really interesting discussion.

Diffuse lighting is now using just three samples (oriented like Orion's belt), with random rotation around the center. The appearance of diffuse GI is a lot softer and less blocky now.

This introduces a lot of noise, but I think this can be removed with a denoise filter:

10x Faster Performance for VR: www.ultraengine.com

Adding ambient light into the voxel grid provides a much better result than adding it in the final light calculation.

10x Faster Performance for VR: www.ultraengine.com

It was a good idea to focus on a single GI stage and just try to get that perfect, before wrangling with multiple stages. Here we can see the near-finished effect on a single stage, along with a mechanism to ease the transition to additional stages:

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
It was a good idea to focus on a single GI stage and just try to get that perfect, before wrangling with multiple stages. Here we can see the near-finished effect on a single stage, along with a mechanism to ease the transition to additional stages:

Do you mean you calculate GI only in the closest mip for now? While at larger distance lighting falls back to constant ambient / environment map?
That's how it looks. If you have GI on also on the distant mips, it would mean voxels already fail to capture coarse architecture.

Btw, to check out accuracy or realistic appearance of indirect lighting, you should use some more colorful textures to see color bleeding.
No bleeding if everything is just brown, so you simply can't see how well it would work.

Personal opinion - it looks way too dark.

I've used reference path tracer to compare.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

This topic is closed to new replies.

Advertisement