Back to General and Gameplay Programming

What kind of coding standard do you prefer

Ivica Kolic · 2022-12-20T20:57:39

So, other than the official answer “what ever standard client prescribes”, what kind of coding standard do you use at home in your personal projects?Here is mine:Standard is optimized for speed, simplicity and readability and fits on one page. To get the best cache performance it is wise to keep your memory sequential and compact as possible (DOP instead of OOP at least at the lower levels). ECS is main paradigm.using vectors instead of object pools. I want sequential and pre-allocated memory and even if memory isn't preallocated (empty vector) it will become so after a few frames (when enough objects get created). Maybe add another vector to track free elements (or wrap those two ones in a template called ConservativeVector - but that is practically object pool if you preallocate enough memory). Be careful with references. If you have a lot of dynamic object this might not be the best solution.using indexes instead of pointers (this goes together with a point above). In most cases unsigned short will be enough and it will be 4 times smaller than pointer so your data will fit in cache better. This also means that you absolutely don't have to write copy constructors and have any kind of memory management (or very little) - literally entire engines and editors can be written without a single new, delete, malloc and free command (at least for static data like meshes and textures). Another benefit is the code can be run on GPU and CPU all the same (if you wrap recursion). I've implemented ray-tracing octrees this way and did all the debugging on the CPU. My BVH hierarchy copies to GPU the same way. If you have some extremely complex hierarchy, debugging with indices is much easier than reading pointers (easier to draw entire hierarchy on paper and setting break points).using Memory Context class to store all those vectors that would otherwise be created as a local variables in the functions. Of course, these memoryContext objects should be reused (and passed by reference) so memory allocations won't happen on every use but only on first few usages. If you have a thread pool, have one of these objects created for every thread. Be careful with recursive functions.don't use sets or lists - they are 10-20 times slower than vectors. Migrate all your helper libraries to use vectors for internal storage (RadixTries, PatriciaTries, Dictionaries, Octrees, Radix Sort functions, geometry processing functions…)SortForEfficiency is a common function in my helper libraries.Always pass by reference. Don't return vectors, but pass them as input-output parameter. That way memory pre-allocation will be maintained.use JSON for storage and clipboard copy paste. It is simple and it will maintain compatibility within different versions of application and anyone can code plugins in any language they want. Nobody has to waste time with writing custom DSL parser every time. If you are very lazy, you can even just dump those vectors in a file and hope that format won't change but I wouldn't rely on that.greedy thread pool. If it isn't faster than OpenMP you haven't created a fast thread pool. In all my use cases I only needed a thread pool where all the threads run the same task so I've actually omitted single threaded type of task in the latest implementation. It is kind of like async but more. There are 3 levels of task priority and only first one is non blocking. If bottleneck happens all the free threads can help the ones that are still working by preforming sub tasks (there are macros that I use in all of my code that will do parallel processing if thread pool is present or just single core if not). One thing, if some thread submits a task to queue and manages to finish it before anyone helped - that is OK - there won't be any penalty like with OpenMp.I rarely use inheritance and virtual functions but good example of it is the GUI system (it is highly efficient and packs one integer per graphical element instead of entire geometry of 4 vertices which is 28 times bigger like with some other libraries). So, in some cases OOD is more elegant than DOD but I mostly prefer DOD. OOP example is using virtual functions for OnCreate and OnUpdate on your engine's objects. On the other hand, if you use DOD and composition, these function can just be properties of the object. If they are set then call them, otherwise do the default functionality.not using single file per class. I like to keep things that make sense together grouped. I also prefer header only libraries for smaller things that won't change much in the project (many times these classes are templated so there isn't much choice). Bigger classes that are updated frequently should have both H and CPP files.That is it. Almost no memory management and very fast code (for certain types of project where most of the objects is mostly static).Edit: edited for clarity.

General and Gameplay Programming Programming

Started by Ivica Kolic December 16, 2022 11:35 AM

82 comments, last by JoeJ 1 year, 4 months ago

Juliean

7,347

December 18, 2022 11:04 AM

JoeJ said:
Juliean said:
This optimization only makes sense when accessing a static variable multiple times within the same function, but it does provide a noticable speedup.
I would never had figured out this on my own, thanks for sharing! :D
Will add and and compare later, with a somewhat larger grid…

You're welcome! Looking forward to the results.

I'm now also implementing the binning/type-filter I mentioned before. So I hope you are happy to hear that you got me motivated to do at least that :P I'll also post the results once I'm done, which might depend - I'll probably start to hardcode stuff for now, if I have to make it super configurable it might be a bit tricky on the tool-side for a start, especially if I don't know how it'll turn out.

JoeJ

4,200

December 18, 2022 11:31 AM

edit: more buggy results removed

Ivica Kolic

104

Author

December 18, 2022 11:39 AM

@Gnollrunner unsinged short is good enough for indexing meshes and textures and sounds. It was not good enough for objects - there I've used integer.

Pointer vs compacted local data: In this discussion there are a lot of interesting examples that demonstrate why sequential access is so much faster than random access in many cases even when using acceleration structures. It is an interesting read, but my original point is that I do it for simplicity and that in most cases it turns out to even give speed boost. Literally looping through 10000 ordered entries to find specific one will be faster than doing binary search or using other acceleration structures. So it is simple and fast but definitely not for every situation. I think that the only conclusion is that testing different options is the only right way to go. I did mention that I use SortForEfficienty a lot for all kinds of containers.

Maybe even if you have objects all over the place (in your heap) you could use some local precaching into small vector and see how much faster solution will be. I did this when I was simulating entire solar systems with a lot of details on planets - obviously I can see less than 1% of entire word so accessing everything in on big heap all the time was a big no. I had that heap and I had localized data that would serialize every now and then and once some object left my local space.

Interesting sample how sequential access is much faster:

I once trained some neural network with 100 GB of data. You would think that if you have 128 GB or RAM you won't have to do much to make things fast but that is wrong because if you shuffle data (which you must) it will cause such a slow down (every memory access will be cache miss). At the end I've implemented memory efficient shuffling that always goes sequentially through segments of the original data and randomly dumps entries into some file (randomly picks segments). I do this several times but the good news is that now everything works even on a 8 GB RAM, works faster but requires 300GB of SSD space. So, using SSD as a cache in a sequential way ended up being much faster than original solution that had 128 GB of RAM and was doing cache misses.

Juliean

7,347

December 18, 2022 12:24 PM

Ivica Kolic said:
@Gnollrunner unsinged short is good enough for indexing meshes and textures and sounds. It was not good enough for objects - there I've used integer.

One comment to that: unsigned short specifically does have an overhead of its own, on standard pc-hardware. The instructions related to it, and how registers need to be handled are not optimal. Shouldn't really matter for most cases, but its something to keep in mind. Specifically, unless you do some packing of multiple numbers in one struct as I described earlier, you should not use “short" as an optimization-measure. Use uint8_t if possible (that one is actually fast), or uint32_t. This one you can check yourself, by seeing what uint16_fast_t etc… are typedefed to be. On PC. its generally “unsigned int” and not “unsigned short” - while “uint8_fast_t” is "unsigned char".

Ivica Kolic said:
Pointer vs compacted local data: In this discussion there are a lot of interesting examples that demonstrate why sequential access is so much faster than random access in many cases even when using acceleration structures. It is an interesting read, but my original point is that I do it for simplicity and that in most cases it turns out to even give speed boost. Literally looping through 10000 ordered entries to find specific one will be faster than doing binary search or using other acceleration structures. So it is simple and fast but definitely not for every situation. I think that the only conclusion is that testing different options is the only right way to go. I did mention that I use SortForEfficienty a lot for all kinds of containers.

Is it simpler for you? Outside of optimization, I do have on example, where I used it for objects by storing a “handle” to an entity instead of the pointer, because the pointer could become invalid, but that does make things actually a lot more complicated in most places, at least for me. Since you now eigther have to store the container where the things reside alongside the int (if you want to make it usable like a pointer - which is what I ended up doing), or have to store the container separately and access/pass it all over the place. But maybe for problems that are more local, it could be not as bad.

Juliean

7,347

December 18, 2022 12:36 PM

JoeJ said:
Haha, 10 times faster. That was a good tip. : ) Now with optimization on in release:

Thats already a good thing ? I'm feeling vindicated right now about my assessment about debug-performance being important (in an earlier thread), since newer MSVC-versions now make “std::move” or “std::forward” be no-ops even in debug-mode, even with the ability to mark custom cast-functions [msvc::intrinsic] to achieve the same effect. So even if the speedup in release and on different compilers is not that big, its still a win in my book for a relatively easy improvement (which doesn't require any structural adjustments of the underlying algorithm).

Ivica Kolic

104

Author

December 18, 2022 01:07 PM

Juliean said:
Is it simpler for you? Outside of optimization, I do have on example, where I used it for objects by storing a “handle” to an entity instead of the pointer, because the pointer could become invalid, but that does make things actually a lot more complicated in most places, at least for me. Since you now eigther have to store the container where the things reside alongside the int (if you want to make it usable like a pointer - which is what I ended up doing), or have to store the container separately and access/pass it all over the place. But maybe for problems that are more local, it could be not as bad.

Samples:

Mesh boolean operation in my CSG library: unsigned short was more than enough for everything. All mem can be preallocated since you know that resulting mesh won't have more that total number of faces of mesh A and B (in most casses less). Debugging is much easier when you have IDX of a face/edge/vertex than pointer to it (who could even read that).
Other sample is the editor where objects would change but only when user is doing something. Most of the time 99.99% rendering thread would loop through it. When I'm changing something, rendering thread is stopped. So this is very simple example with practically static data.

Anti-sample (but also sample because of precaching):

Engine that handles entire solar systems. Since there are millions of objects and no way they can fit in CPU cache, I precache local objects into those small preallocated vectors. In most cases les than 0.001% of objects is accessible to player - if these objects are locally precached then practically everything fits in a local cache and CPU doesn't have to jump all over the place to access correct memory. Network engine did however had to jump all over the place, but very rarely because it was p2p network engine and you might have end up bein server for interior of some space station but you are also client on some spaceship that left that space station.

Juliean

7,347

December 18, 2022 01:15 PM

Ivica Kolic said:
Mesh boolean operation in my CSG library: unsigned short was more than enough for everything. All mem can be preallocated since you know that resulting mesh won't have more that total number of faces of mesh A and B (in most casses less). Debugging is much easier when you have IDX of a face/edge/vertex than pointer to it (who could even read that).

So for indexing into “primitive” data inside the mesh, if I understand correctly? This should also fall under the catory of “local” problem that I described - you are probably using those indices inside a class/system where you always have access to all the data needed to actually resolve the index back to a pointer/data if need to.

As for readability/debuggability: If we are talking about primitive data, sure. A pointer to an int is way less indicative than an index saying which int it is. But if you have a pointer to an object (with multiple fields, potential inheritance etc…) than the index would become way less understandable IMHO. So its object 425… nice? Whats its properties? What type is it? etc… and if you still want to know which index is it, you can always store the index (even only in debug-builds conditionally, if so desired).

Ivica Kolic said:
Other sample is the editor where objects would change but only when user is doing something. Most of the time 99.99% rendering thread would loop through it. When I'm changing something, rendering thread is stopped. So this is very simple example with practically static data.

That is also one of the benefits I mentioned earlier about handling the backend of my renderer with indicexed integers. The new code for handling device-removal can simply change the stored pointers internally, without forcing users to update what they have. Click noise.
But on the other hand, the actual API/Plugin/Script-Code still gets a “Texture” object-pointer (which is decoupled from the backend via the integer. If find it way easier to code high-level logic that way. Want size of texture? texture.GetSize(). Obviously, getTextureSize(textureID) could be used, but this would require the storage for textures to be stored globally behind the scenes; and you'd have to make different “ID”/handle types so that you cannot just call getTextureSize(vertexBufferID) by accident (and get to keep the other benefits of strongly typed fields, when we are talking about ie. binding them to a scripting-API by using templates, and not having to specify the type yourself).

JoeJ

4,200

December 18, 2022 01:38 PM

edit: even more bugs.

JoeJ

4,200

December 18, 2022 01:45 PM

Btw, i did the same thing on GPU, maybe 7-8 years ago, on some GCN.
GPU really is super fast for brute force N-Body problem like this.
But the threshold was 64 bodies.
After that, building such grid became faster.

I guess nothing has changed since that for modern GPUs, but not sure.

Juliean

7,347

December 18, 2022 01:58 PM

JoeJ said:
However, only now i notice brute force became faster than any grids with Clang. With a whopping 5000 objects. I will need days to swallow this down and deal with it… :O

Wuut? Are you sure that you didn't introduce some suble error in the algorithm for eigther case? I mean, it does find the same number of elements, but even I would not have expected this to happen for that many objects.

Alternatively, maybe its because with that many elements, the cells all become “satured”, and you end up doing O(N^2) for each for a number of elements that is too high comparatively? I'm not really sure how the cells are divided exactly, but root of 5000 is 70, is of we assume 8*8 cells, that would mean roughly 8-9 elements on each axis per cell, so 81, which would mean that even per cell you are doing about as many brute-force steps per cell than my game is doing in high-load scenarios ? Assuming that intra-cell comparisons are O(N^2), which I assume. So following that logic, brute force would be 12.500.000 “steps”, while 64-grid cells * (81*81) collisions would mean 420.000 by my napkin math, which is a difference of x25 “steps” for brute-force, which would still fall within the range that one could except the raw speed to outweight the complexity-difference.

On my own front, I did implement the collision-filtering, with somehow good results:

Running my test-sequence with the bossfight with the most projectiles, in debug execution-speed goes up from 35x to 72x. Nice! I don't have a release-variant for this single test, but for the whole project-tests, it went from 87s to 83s, which is a good improvement considering the length of game that is being run in that time. Sorry for the unusual performance-metrics, I could measure the actual time being spent inside the actual loop, but I do care about how it globally affects those test-speeds the most, this has a practical advantage for me ? Considering that those numbers do include everything that is being run with the game, those numbers do mean a lot actually.
EDIT: See next post for some corrections

The code is here: https://pastebin.com/gJRJ3M0V

There is a lot of engine-specific yadayda going on, but I hope the core of the algorithm is understandable. Notable things are:

Iterating over the actual containers I have still done in O(N^2), which made ma chuckle a bit. But I am fairly certain that for an N of 5 as of now, it is still better, and easier to type than creating explicit list-of target types for this quick example ?
There is some additional complexity in having to order collisions, which already existed before, which not every system might need. So this could be used to speed up things even more
There a few things which potentially could be moved around to eigther improve, or make worse, the characteristics. Like certain properties could additionally be cached when moving the elements into their bins, or less, depending on the use-case. This does depend a bit on how often things actually collide, which brings me to my next point:
This implementation is only for “touch”-triggers, meaning no physically blocking collision. Which also means that the actual % of overlaps here is very few each frame. The actual collision-system, at least for this new format, could also be updated somewhat easily, as I have certain guarantees (like than an entity will not physically be deleted until the end of frame). However, the system in its entirely is a bit more complex, so going beyond that with a grid there wouldn't be as easy. In short, the blocking-collisions are not handled just in one loop. For the specific movement-system I have, it is necessary to dynamically update potential locations of objects while they move, as well as potential collision-pairs, in a sense that cannot simply be resolved by pushing collisions out of their contact points.

I'll continue implementing this for blocking-collisions as well, and see what the final numbers say ?

What kind of coding standard do you prefer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

What kind of coding standard do you prefer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines