How to multithread, conceptually

Started by
12 comments, last by frob 1 year, 4 months ago

Hi,

I've been doing some reading on multithreading and investigating if I'm going to do this in my game engine.
With this topic I'd like to check if my understanding of the concept is correct. Below is some high level pseudo code of my understanding how it can work.

Also curious to hear opinions and pro's and cons. I know it's quite a broad topic, but imho it starts with understanding the basics.

Baseline, no multithreading:

https://pastebin.com/W7vUf8XQ

GPU multithreading (dividing draw calls, I know it's also possible to spread passes over threads):

https://pastebin.com/2XenDR5B

CPU multithreading:

https://pastebin.com/Neiw7cZW​

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

Advertisement

The 3rd link does not work, commenting on the second.

I lack experience with both MT rendering and DX, but one minor issue i see is your assumption that each thread does the same amount of work, and so that each work item takes the same time:

uint renderablesPerThread = myRenderBucket.size() / numThreads;
 
    for(uint thread=0;thread<numTreads;++thread)
    {
        SetDeviceContext(thread);
        for(size_t rend=thread*renderablesPerThread;rend<thread*renderablesPerThread+renderablesPerThread;++rend)

This may or may not be the case. Some threads may finish early and then wait while others are still busy and could need help.
I assume nowadays this problem becomes even larger, after we have efficiency and performance cores on recent Intel CPUs.

The alternative is something like a job system, where threads pull work items using an atomic increment.
Downside is that now all threads may concurrently fight for work, which has some synchronization cost. The less time a work item takes to process, the larger this problem becomes.
To minimize this cost, we can make larger packets, e.g. pulling X work items instead just one, and processing these sequentially.
By tweaking the number X you can get the best compromise on saturating all threads and making sure they all finish at the same time.

As for OpenGL, you cannot share a context among multiple threads. When I did a multithreaded MFC / Direct3D 9 app a while ago, I dedicated one thread to rendering. I would be pleasantly surprised to find out that Direct3D now works differently.

taby said:
I would be pleasantly surprised to find out that Direct3D now works differently.

Both DX12 and 11 can do it, so that's a big reason why OpenGL is quite outdated in modern times.

But idk what the ‘AZDO’ technique has enabled back then. Iirc, this was (or is) the OpenGL way to address the draw call problem.

JoeJ said:
Both DX12 and 11 can do it, so that's a big reason why OpenGL is quite outdated in modern times.

It is mostly a mental issue, not a software one.

There is only one hardware bus, and ultimately the commands get serialized through there, through the drivers. The GPU can handle multiprocessing internally, and the CPU these days are all multiprocessing, but there is still a serial, sequential bottleneck with the drivers and hardware bus.

It can conceptually be easier to work with multiple data streams working in parallel, and drivers for Vulkan and modern DX can work with those multiple CPU threads and make them work internally with the drivers, but under the hood, there's still only a single hardware interface.

I do a lot of multithreading in my engine but I don't really use it for rendering (GPU related stuff), except in a few places where I need to destroy meshes and I don't want to slow down the rendering waiting to do it.

For CPU stuff my main comment is that “thread pools” seem to really speed up things especially if you are doing a relatively small amount of work in each thread. What I do is keep N number of threads waiting on a queue and I have a dispatcher thread which sends work to the first available thread or waits until there is one that becomes free. I noticed a huge speed up going to this method, from just simply creating and destroying threads every time I needed to do something. The other thing I'll say is, try to do at least a modicum of work in each thread and keep data used by that thread cache friendly. I use separate heaps for each thread in the pool.

Thanks all, this helps. I’ll do some prototyping/ testing with CPU threading.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

frob said:
It can conceptually be easier to work with multiple data streams working in parallel, and drivers for Vulkan and modern DX can work with those multiple CPU threads and make them work internally with the drivers, but under the hood, there's still only a single hardware interface.

I guess it's a win if you really have complex rendering going on, so MT helps to translate many API calls to data and commands the GPU can actually use. The bus bottleneck after that remains ofc.

But to me it feels more attractive to minimize the number of materials, pipelines, draw calls etc., and then do bindless and GPU driven rendering. The rendering cost for CPU should be so small no MT is needed for gfx.
Though, i'm not sure this works as well as i hope - still need to learn about this first…

Gnollrunner said:
What I do is keep N number of threads waiting on a queue and I have a dispatcher thread which sends work to the first available thread or waits until there is one that becomes free. I noticed a huge speed up going to this method, from just simply creating and destroying threads every time I needed to do something.

When i started work on my current editor / offline tools, i tried to get rid of a cumbersome job system but only use new C++ features instead for MT.
But this was much too slow even for offline needs. It caused launching new threads constantly, and also way too many threads at the same time. : (

The burden with the job system is that i need to write callbacks for every thing i want to do in parallel, plus eventually some struct to hold context. Very often i'm just too lazy for that, so lots of my stuff remains single threaded.
Having all those callbacks around also makes the code harder to maintain. It sucks and feels old school.
I know it's possible to do better. The physics engine i use can parallelize lamba functions without any callback. I guess it has some extra cost from using function objects, but not sure. I should look how this actually works…

I must say it is still a bit hard to do efficient parallelization even for CPU. Too much extra work and expertise is required for something which should be easily accessible and have native support in languages. Still hoping for future C++ improvements, but it's a long wait.

Just one additional thing I thought I'd throw out:

If you are working with a largescale system that is not suited for multithreading (not talking about GPUs in this case), then one possible solution for multi-threading could be, by copying the non-threadsafe datastructures on the main-thread, to a separate version that is only used by the thread. In this case I'm not talking about this as an optimization, but strictly to enable multithreading in a scenario where you would otherwise need to add synchronization all over the place.

The one example where I did this, was for the background-compiler for my visual-scripting. Think Intellisense - when I change something in the code, I want to get highlightin for errors etc… automatically, as soon as possible. Now my engine internally is not really made for multi-threading, and the compilation obviously can take way too long to execute on the same thread. So, what I did was run the preprocessor-step (that converts editor-suited data to a compilation-friendly format) on the main-thread, then have the rest of compiler run on that newly created data on a separate thread (which only it has access too). In the end, the result is processed serially again (which is cheap by itself).

Such a process obviously still has some overhead on the main-thread. In my case, its acceptable, its only a few MS compared to the multiple seconds that the whole compilation can take (at least when running in debug). But still, you need an algorithm that in itself is so expensive that this additional serialized copy-step pays of. In my case, I got it kind of for free since the preprocessor in that form would exist anyways, its just that I cannot move that one step to the thread.

JoeJ said:
Too much extra work and expertise is required for something which should be easily accessible and have native support

It has been native in C++ starting in C++11, and improving in C++14, C++17, and C++20. There are more improvements coming.

Multiprocessing has always required additional thought and care, regardless of if you're using language-native functionality or a software library. Even the simplest approaches of parallel-for and parallel-task multiprocessing can have serious, hard-to-find bugs if there are unexpected dependencies in data or processing. More complex parallel algorithms like parallel search, parallel sort, and parallel vector transformation require significantly more effort to implement and get right.

In games usually there are too few people who have the skills to implement them, those who know how to properly partition the problems and map them to a parallel solution. Typically those people will implement a basic task-based system and throw them to the masses. Then they'll spend the time during the rest of the project implementing more proper parallel solutions as needed, and finding bugs introduced by other programmers who still think of computers as a single-process device.

This topic is closed to new replies.

Advertisement