Advertisement

I am starting some research on optimizations for my game engine; yes I reached that point. I have a Alienware R18 M2 so this will be fun. First before i go on, I am getting the machine code to as small as possible for cache footprint, and localidy. Now I alread at the same time setting functions / methods to into conserve resource and it directs the load and minimize resources; its nifty seeing high level designs surprise you by the effect and im wondering if this is a novel feature.

I am looking into parsing a syntax tree in OpenSL on my GPU using compute shaders, so GPGPU abuse, and have excessive pipes lines, or special conditions and only having one global one and a giant parse tree created / updated by the video card, so u perform directly on our data stored in kernel space which is my dynamic / static memory systems that allows me to do linking upon a custom heap, and STL replaced by high performance House code; i have only house code for everything except win32 / Posix for Linux.

I am looking into kernel bypass if applicable to my project's problem wish I could share.. And i will focus the same ideas and goals I originally had at the start.. I only scrapped the surface such as async buffer IO with pipes/filters & hoping to finish my global single pipeline, hint its made special, with some extra threads in the thread pool for computation or data transfer by the cpu if randomly needed.

I hoped I shared enough so I may get some novel hints in my head. I will write up some document and post it later here, in this thread. Maybe I will write two books, one on novel Game Engine architecture for dynamic Real Time Games / Projects; my second one on severely optimized by design, code efficiency, cache footprint and seeing the results of locality.

Lastly I almost finished a c compiler, and am studying advanced parsing, https://www.amazon.com/Parsing-Techniques-Practical-Monographs-Computer/dp/1441919015; once completed I will start on my own c++ computer which only will carry the build to assembler output. This, by leveraging the libraries, STL, entry text/code and the other mess of code which will need completed; so I must say I will also write my own linker, i know rare but would give me massive benifits like frame management to the engine not just the loader of the executable, passed on responsibilities to the Engine getting it describe every part, of fascet, of its self.

Does anyone have suggestions, topic to lookup as I will research if I have to, comments or design, or even more specificly architecture advice. Yet, do not be fooled my architecture and general design is very advance and complicated, so any help will do

Learn to use a profiler. Microsoft Pix if you are on Windows.

Actual measurement is the only way to know.

Far too many beginners jump into changes thinking they are improving performance when either the specific code wasn't slow to begin with, or they accidentally made it even worse because of not measuring both before and after.

If you intend to go deep, you will want to learn all about the out of order core, the phases of the modern cpu pipeline, and all the caches and buffers internal to the processors. The performance isn't just due to the code being run, but much of the surrounding code and even code from other processes will affect your code's performance.

Advertisement

@scruthut Damn dude, you’re really going all-in on this - props for the dedication! Sounds like you’re basically building a custom low-level beast from scratch, optimizing down to the bare metal with cache-efficient design, custom heaps, and even your own linker/compiler. That’s some next-level stuff.

For parsing a syntax tree on the GPU with compute shaders, yeah, GPGPU abuse can be crazy efficient if you manage memory transfers properly. Might be worth looking into wavefront processing and SIMD optimizations to really squeeze out performance. Since you're already deep into custom memory management, persistent data structures could be an interesting area to explore for keeping that parse tree updated without excessive reallocations.

Kernel bypass is tricky, but DPDK (for networking) and SPDK (for storage) do similar things - maybe some concepts from those can apply? Also, if you’re messing with async buffer I/O, io_uring on Linux might be worth a dive for reducing syscall overhead.

And writing your own C++ compiler and linker? Absolute madman. If you’re rolling your own STL replacement, check out EASTL (EA’s STL) for ideas on cache-friendly containers. Also, since you’re pushing compiler efficiency, maybe take a look at Cranelift - it’s built for fast JIT compilation and could give some insights into codegen efficiency.

Tbh, sounds like you’re sitting on enough material for a serious book on high-performance game engine design ^_^

This is all keyword mumbo jumbo. What exact code/algorithm are you trying to optimize?

I am looking into parsing a syntax tree in OpenSL on my GPU using compute shaders, so GPGPU abuse, and have excessive pipes lines, or special conditions and only having one global one and a giant parse tree created / updated by the video card, so u perform directly on our data stored in kernel space which is my dynamic / static memory systems that allows me to do linking upon a custom heap, and STL replaced by high performance House code; i have only house code for everything except win32 / Posix for Linux.

I am looking into kernel bypass if applicable to my project's problem wish I could share.. And i will focus the same ideas and goals I originally had at the start.. I only scrapped the surface such as async buffer IO with pipes/filters & hoping to finish my global single pipeline, hint its made special, with some extra threads in the thread pool for computation or data transfer by the cpu if randomly needed.


once completed I will start on my own c++ computer

NBA2K, Madden, Maneater, Killing Floor, Sims

Advertisement