Optimal multi-threading structure for efficient CPU usage

Started by
4 comments, last by Shaarigan 2 years, 9 months ago

Hi,

I have reached a point in the development of my game application, where I would like to make use of multi-threading to improve its performance. By doing some research, I have come to understand that the best way to do so is to implement a Thread Pool/Job System and then split all the parallelizable work into tasks that worker threads can execute simultaneously.

So far so good.

However, I am struggling a bit to comprehend how to make sure that tasks of varying types do not end up blocking the worker threads, particularly I/O operations (Disk read/writes, Network communication, GPU commands, etc.).

To demonstrate what I mean, I have conceived an example of a thread pool running on a quad core system.

In this hypothetical scenario, all the parallel tasks are loaded into a single queue and then each one is picked out of the queue by one of the worker threads.

The main problem here is that I/O operations might end up completely blocking one of the threads even for multiple frames, decreasing the number of workers available to other the queued tasks (thus decreasing performance).

What’s even worse is that when loading a resource from disk, the thread will most likely not be doing anything, since it’s waiting for the needed data to be loaded into RAM.

In this trivialized example, only worker thread #1 is being utilized close to its fullest potential. Worker threads #2 and #3 are spending half of their running time waiting for I/O, while thread #4 is spending all of its time waiting for the GPU to finish rendering (of course there is some processing being done by the driver but let’s keep it simple).

Not only that. If for some reason you request a large number of I/O operations, all the worker threads might end up getting blocked, effectively causing the game to freeze. Not good!

This made me rethink my approach. I considered implementing a priority system so that the important tasks would get always processed, however, this doesn’t really solve the problem of the CPU being underutilized, plus it also seems like unnecessary overhead.

My current solution is to have separate threads or even separate thread pools for different types of tasks.

In this example, the worker threads are being utilized for the important game logic work (entity updates, visibility computation, GPU command list generation), while „I/O thread“ & „Render thread“ are completely separate (Both of them also maintain a separate queue).

My concern with this solution is mainly about overhead. Since you have more threads than physical cores, the I/O and Render threads will also take end up taking some of the CPU time that could have gone to a worker thread. Mainly, I am not certain of how well would this work across different CPU architectures and operating systems with different schedulers.

Also, it seems as a good idea to have multiple I/O threads based on the type of drive, as modern SSDs can apparently process multiple data requests at once.

Anyway… Am I on the right track? Have any of you ever encountered this problem and how did you solve it. If so, could perhaps point me towards a good learning resource (since none of the tutorials I looked at seem to mention this issue).

Advertisement

Seems you are on the right track. A key takeaway that you seem to have figured out is that IO bound tasks and CPU bound tasks are different and should potentially be treated in different ways.

IO bound tasks seem to work best in an asynchronous setting where the application tells the OS, or high-level primitives specific to the language, “wake me up when something is ready”. Having a thread sleeping on something doesn't take up many resources except for the memory allocation of the thread itself as far as I'm aware. This leads to the possibility of spawning many more threads than there are actual hardware threads without any issue. I think different languages have slightly different models but most probably have one-to-many relationship when it comes to software threads to hardware threads. As an example of running many threads, for one Java application that was completely IO bound I spawned a few hundred threads causing the CPU to go up quit a bit but there was still a lot of CPU resources left.

CPU bound tasks are a bit different. Depending on how evenly you can spread out the work. It's either best to run up a bunch of threads and let them finish on their own if you can distribute the work evenly beforehand and there is a lot of work, or it's best to have the threads take tasks from some sort of central place (a queue as you mention) which might be seen as “work stealing” threads.

Depending on the amount of data, running things on different threads might actually slow down an application due to overhead so it's probably good to do some measurements on this.

If you google how they do it for operating systems you can probably find a lot of information how it can generally be done. A framework I like in Rust is Tokio which is an asynchronous framework that can also spin up CPU heavy work on threads (thus being capable of solving both problems in a very user-friendly fashion). In Qt if I remember correctly, for asynchronous tasks, you spawn a thread and then report back results when the IO you're waiting for is ready. Not sure if you spawn the same kind of thread for CPU heavy tasks in that framework. There are frameworks for these types of problems in most languages so you can surely find something to reference wherever you look.

Edit: When I say that they are different, I mean they are at least different to the user. To you they may be treated similarly, but the user probably wants IO bound API for saying “call me back when ready” and CPU bound API for saying “spawn up this work on a thread and run it to completion”.

I use multiple thread pools of different thread counts for different purposes and other one-off threads for specific use-cases. In your example, I'd have:

  1. A “per-frame” thread pool that is filled with sub-frame tasks that always gets waited on by the end of the frame (e.g. go wide and update physics, go wide and record render commands). Always has a fixed number of threads less than the max # of hardware threads on the system.
  2. A generic long-running task thread pool for arbitrary long-running tasks. May allocate more threads for more tasks (up to some relatively high max).
  3. A dedicated I/O thread pool. This might only be a single thread (maybe 2) since you can only read data so fast, especially from a hard drive.
  4. A one-off thread for your swap/present synchronization and whatever else you need.

Separate user APIs is also useful as already mentioned as different types of tasks are easier to work with with different API structures (e.g. poll+wait vs. callback).

Three issues:

  1. Most modern Intel CPUs have multiple “hardware threads” per core (hyper-threading.) You will typically want to have one native thread per hardware thread, not just per core.
  2. Some APIs for I/O allow you to do OVERLAPPED work (Windows I/O completion ports, Linux libevent, etc) If you can make all your I/O be asynchronous/queued/overlapped, then you don't need a separate I/O thread. This may, however, be hard, so one I/O thread per blocking I/O device is probably reasonable.
  3. Other parts of the system may also need some CPU time to allow you to play the game. The graphics drivers often spawn their own set of threads, and you have no idea how many, or how much work they will do. Keyboard/mouse/discord/teamspeak/VPNs and all the rest may need a few cycles, or even a full core, and you won't know ahead of time.

So, the best approach is to spawn threads that let your game run optimally-ish, but then realize that the rest of the system will also need some throughput. So, spawning N-1 worker threads on an N-hardware-thread system is totally reasonable, making sure you don't block out all other processing just because you're rendering an extra-fancy particle system or whatever.

Then again, what will you use all those cores for on higher-end CPUs? Here's a Threadripper, there's 64 cores and 128 threads – now what? Your game probably have a minimum spec, and some ability to scale upwards in quality, but “infinite scalability” is never actually a thing in reality, so some upper number of threads that you have actually tested with is totally reasonable. For example, you could spawn 1 I/O thread, 1 render thread, 1 network thread, and then min(2+(N/2), 8) game threads, where “N” is the number of hardware threads on the system.

Exactly how to do all of this depends on your specifics, and what your minimum spec is, and how far you can actually use any extra threads that become available, over the minimum spec.

enum Bool { True, False, FileNotFound };

You didn't mention a language you're working in, this however may make a difference. In C# for example we have the .NET implementation already done in System.Threading.Tasks and many APIs like file I/O already provide an ‘async’ version which will return a task you could wait on completion. The internals are a bit complicated and better explained on some blog posts but the key principle is that .NET maintains a single thread pool internally which task worker threads are running on. Tasks are nothing than promise objects which maintain a state and a reference to an asynchronous ‘thing’. (Yes, thing is the right term here because it can be anything from a callback over a function to an HTTP request which may proceed async)

C# also provides the async/ await keywords. This is black compiler magic; the C# compiler creates a state mashine for those functions, splitting up every await keyword into separate states. This is where the true asynchronous magic happens behind the scenes. The state mashine is processed on a worker thread until an await is hit. It returns the awaitable object instance, usually a Task but it may also be extended to something else (since it is compiler and not a language feature). Then if the dependency awaitable is cleared, it's state is set to something completed, the awaitable is removed from the worker and the dependant task is resumed. This may happen on the same but also another thread.

C++ on the other hand can provide the same features (and I don't mean the standard async/ await keywords, which do the same in the background). I have worked on the same topic as you for a long time reading through different articles and blogs. My first implementation, was as yours, a trhead pool of worker threads which picked “tasks” from a queue. There is nothing wrong about that except for when you want describe a scenarion which involves some async process going on. But we have differentiate between different asynchronous/ blocking processes going on in a game. Btw. thread scheduling is way more costly in Windows than in Linux so it should be avoided to perform too many of it.

File I/O: This can be optimized in several ways and may not involve blocking at all. A naive approach is to have a directory of assets on the disk you'll load from time to time, 3D models, scene data, textures. The ususal stuff. But already this disk I/O can be optimized like a good CPU firendly code, the disk also has a cache. A usual disk cache is 64k, so padding your file to 64k will occupy the entire disk cache and everything else needs to wait. This may sound stupid but it guarantees that a reading process is finishing without the need to wait for another process to complete when your data is fragmenting in the cache. As an example: you have a file of 55k and one of 32k means that one process needs to wait for the other to complete in order to free the occupied cache space to load the remaining portion of the file into.

Using buffered streams is also a good alternative to plain file read/write operations. The stream caches a portion of the data your read/write and synchronizes it to/from disk if necessary. Speeds up performance also a little.

A more advanced file I/O operation optimization is memory mapped files. Those widely available OS feature lets the OS manage all your I/O in virtual memory pages. Again, a padding of 64k e.g. 4k (the usual OS page size) will improve performance. The OS is reading your file request from disk as usual but copies the contents over into a virtual memory page. This page is automatically freed up if not in use anymore and you can have a single file which is larger than your installed RAM (on-demand swapping). Another benefits is that more than one thread can process the file at once because everything is already in memory and you obtain a pointer to it rather than a file handle and so more than one trhead can read the same block of memory at a time.

Network I/O: This should anyways be non-blocking and there are mechanisms built into the OS as well, which help to achieve that. It depends on your platform because Windows provides I/O Completion Ports (IOCP) while Unix and so most Consoles out there (Nintendo Switch/ Sony Playstation which seem to be build on the Unix Kernel) have the poll command to query a socket state. However, one can build up IOCP more easily within Linux than vice versa so I decided to go for IOCP. IOCP is, simplified speaking, an event object bound to a specific occurance (for example in network communication) which tells the caller that there are data to receive so that the next call to receive on a socket will be guaranteed to not block. Those event objects are queued by the OS so you can have more than one. For a client application, I always create 1 thread out of the thread pool and make this thread awaiting an IOCP event, on network heavy games it may become 2 threads. This threads are sleeping for as long as there isn't any network action going on and if they awake, I'll do nothing except pushing the package into the message queue, which is processed by my usual worker threads.

However, since all of this won't help on one thing, when tasks depend on other tasks to be completed, there was need for me to do some further research and I found this GDC Talk about to to paralelize your game with fibers: micro tasking without the OS thread scheduler. Fibers, once part of the OS but since they made too much trouble in the past Windows removed them, are a way to break execution of a function in order to allow something else to be done. They work similar to .NET tasks but don't involve a state machine nor require compiler magic to take place.

Our implementation is quite straight forward. We used some assembly instructions to create a new callstack on a virtual memory page. This stacks can be of any size, I tested 1kb for it and it seems to work for now. Then another assembly function comes in which performs a switch on the current thread, the current callstack is swapped with that one from the memory page, which leads to an entire switch from one point of execution to another.

It works quiet well on Windows (x86/x64) and Linux (x86/x64/ARM) and we build up our task scheduler from that two functions. It operates on our existing thread pool by occupying one thread for each CPU Core - 1 by spawning a worker on each of those threads. The main thread joins the scheduler as the “- 1”th thread so we can initialize our game engine properly and then operate on tasks only. The scheduler itself manages tasks by adding them to a random worker's queue in a round robin fashion. If a worker runs out of tasks, it'll begin to steal tasks from other workers or is set to sleep for as long as there isn't anymore work to do. Dispatching a task will resume a sleeping worker, this helps us to control CPU usage along with the amount of work to do and saves processing resources.

This sounds quiet similar to what I described before but the real magic happens when we hit our own Await functions. The task maintains a state which is an atomic integer and indicates the amount of work which has to happen before the task can resume. When a worker starts processing of a task, it's callstack is switched with a special bootstrap function. This function is small and only has to guarantee that the thread is never running over a task's boundaries (the task needs to pass the thread back to the worker by switching the callstack again). So we're switching from the worker to the task's callstack and start to execute the task, which may call await on every time while executing. An await sets the task's dependency counter to the amount of work to be awaited and switches the callstack back to the worker. The worker will then check the task for completion and else puts it into a waiting queue. This is queried for tasks ready to resume before a new task is picked from a worker's queue.

The system works like a charm and I was happy to observe a significant decrease of idle threads in our profiler

This topic is closed to new replies.

Advertisement