Ecs Architecture Efficiency

Started by
17 comments, last by Shannon Barber 7 years, 9 months ago

I was wondering for all those C++ programmers out there, what kind of efficiency your getting from an ECS (Entity Component System) game architecture. I'm wondering about max cycles, how many entities and corresponding components you can create in your system before UPS and FPS drops below the standard 60.

This here.... is significantly different than....

Ok let me tackle this from a different perspective.... (No I'm not an experience programmer in C++; just learned it a couple months ago.... still learning)

Say I have multiple vectors of custom types like so:


class Base {
private:
    int id;
public:
    Base(int ID) : id(ID) {}
}

class A : public Base {
public:
    int x;
    A(int ID) : Base(ID) {}
}

class B : public Base {
public:
    int y;
    B(int ID) : Base(ID) {}
}

class Controller {
public:
    std::vector<std::shared_ptr<A>> AList;
    std::vector<std::shared_ptr<B>> BList;
}

Then I have a class that's going to iterate over the 2, because A has info B needs to operate:


class DoSomething {
private:
    std::shared_ptr<Controller> MasterController(new MasterController());

public:
    DoSomething() {
        for(int i = 0; i < 500; i++) {
            std::shared_ptr<A> newA(new A(i));
            MasterController->AList.emplace_back(std::move(newA));
        }
        for(int i = 0; i < 500; i++) {
            std::shared_ptr<B> newB(new B(i));
            MasterController->BList.emplace_back(std::move(newB));
        }
    }

    void Running_DoSomething() {
        for(std::vector<std::shared_ptr<A>>::iterator itA = MasterController->AList.begin(); itA != MasterController->AList.end(); ++itA) {
            for(std::vector<std::shared_ptr<B>>::iterator itB = MasterController->BList.begin(); itB != MasterController->BList.end(); ++itB) {
                if((*itB)->id == (*itA)->id) {
                    ...Do Something >>>
                }
            }
        }
    }
}

Let's say this operation Running_DoSomething is called 60 times per second, iterating over 1 vector has very little effect on the FPS, but when I iterator over a 2nd it drops the FPS down to like 8/sec. Why is iterating over multiple vectors like this so slow? And how can I correct this to prevent the drop in max cycles (where max cycles is a measure of how many total potential cycles of the game loop occur each second)

Please remember that the more specific your question is, the more useful the answer becomes to you before you say that this is the wrong forum.

But as hodge had mentioned, your big O complexity becomes O(N^2) instead of O(N)

N being the number of items that you are processing. The higher this is, the worse performance might become. Mind you that Big O is a function that represents the upper bounds of your performance. And is usually rounded.

If you're trying to improve efficiency, then it's usually recommended to provide some means that can help reduce big O back down to O(N) which is a normal ECS system, or any system with a better scaling.

So lets try refactoring your code.

You already know the ID's of your object in your first vector. But then you're testing these ID's against everything in your second vector. This is terribly redundant.

The first option is usually to combine your data. And this reduces BigO back to O(N)

The second is to limit how much data you're traversing. For this one, you keep your first vector. You can store everything from the second vector in a MAP, and look up by IDs. A Map has a best case of O(1).

As you traverse the vector, you take it's ID and you ask for it from the Map. The map will provide you a pointer in which you can use to do some stuff. Because I'm going on a whim and thinking you're testing for multiple objects. Your Map would be a Map of Linked Lists or Vectors.

Advertisement

There is 1000ms in a second, in order to keep 60Fps no frame can take more than 16ms, but then you have to spread those 16ms over all your systems. Now obviously that is pretty difficult/impossible so you need to come up with ways to only do work when absolutely necessary. I have found, that a general Update function on a system kills Frame Rates pretty quickly... so I stick with a messaging queue and try to only run the update function under two conditions 1) A new frame was rendered AND 2) The system REQUIRES the update function over the message system... of all the systems I have so far only the tilemap renderer and minimap renderer do any work in their Update functions, everything else waits for a request.

And, this is something I *Just* found out, but since ECS isn't good for GUIs you need to remeber that some of that time will be spend on your GUI system as well...

So, in your ECS, the Number of entities shouldn't affect your framerates... just the quantity of work they all request.

Also, I am not 100% certain why... but my minimap rendering takes longer than my main tilemap rendering... but since it is much less important I target 4FPS rather than 60 on the minimap.

I have found, that a general Update function on a system kills Frame Rates pretty quickly


I have never found a game where I didn't need an Update function running every frame. If you've attempted to remove that, then I expect you're in for a hard time.

However, there doesn't have to be 1 logic frame for every 1 rendering frame.

all the systems I have so far only the tilemap renderer and minimap renderer do any work in their Update functions, everything else waits for a request

If you have a ton of tiny components each with an empty Update function then I can imagine that would waste some time, certainly. Maybe you need a publisher/subscriber system that notes whether a component has an overridden Update call or not.

@Kylotan: I may have misspoke..

My system has Entities (A wrapper around an integer), Components (structs with only data members), and Systems...

Systems are a subclass of SystemBase which contains an Update and ProcessRequest function.

The Gamestate has a SystemManager... it has

<PsuedoCode>

ProcessRequests{

request=Requests.Pop

Foreach system

if (system.ProcessRequest(Request)) break;

}

update{

Foreach system

system.Update()

}

The main game loop looks like:

<Psuedocode again>

do{

Gamestate.SystemManager.ProcessRequests(Request);

Gamestate.SystemManager.Update();

}

A request Contains: 1) The request type (an enumeration of things such as AIRequest, PhysicsRequest, etc.

2) The ID of the requesting Entity

3) An object representing additional request data which is different depending on request type.

So the majority of the systems do their work in the ProcessRequest function... so they don't iterate over any entities or components and only deal with the Request.Requestor entity.

The update function is only used on systems like the Tilerenderer/Minimap where I can already know which entities *Would have* sent a request...

I.E, since the tilerenderer already knows the screen size, it already knows that it will draw cells(a,b) to (m,n) so there is no need for those entities to send a render request.

Neither System Nor Entities contain the Components as components are all in their own containers, which is a bunch of hashmaps from EntityId->Component

I don't think you misspoke, you're just suggesting that a regular Update function doesn't work well, when in fact thousands of games have shipped with one. If Unity games can put an Update call on every MonoBehavior component, there's no reason why your game should be too slow just having such an update for each system.

Sorry I'll try to be more specific in future posts, but essentially I'm looking at the ability to render as many entities as possible to the screen before it drops the frame rate below 60 FPS.

@Tangletail I'm aware that those 2 are very different, I did mention in the second example that I was going to try and approach the question differently....

@Paragon123 Spread that 16ms out to every system??? Are you trying to run every system at 60FPS? That's wasteful and pointless.... Your rendering should be the only system operating that fast, all other systems that update data can be ran as slow as 15 times per second and even a 3D MMO would run just fine, in fact most modern MMO clients operate at 15-25 Updates per second, where as their servers operate at 10-15 Updates per second. Not to mention there's the option to run your rendering on a seperate thread asynchronously, which most games don't even do anymore. Your doing what Space Engineers is doing wrong... trying to run your updates at the same semi-fixed rate as your renders. This will only create more headaches, and give you less resources to work with. Also I have already designed a system for running system updates only when necessary based on a message queue, utilizing a cause & effect model that I designed years ago. But Rendering occurs regardless of other systems. And I've never had issues with GUIs in ECS models.

I've actually made a great deal of improvements on my original engine since this, and after testing:

My first version rendered up to 75 entities simultaneously to the screen before dropping below 60 FPS.

My second version rendered up to 2,300 entities simultaneously to the screen before dropping below 60 FPS.

And my third version rendered up to 12,000 entities simultaneously to the screen before dropping below 60 FPS.

Still looking for more improvements that will make this even more efficient. But my question is actually really simple, how fast are your engines running and what specific design patterns are you using to improve how many simultaneous entities can exist and be drawn without slowing your engine down. I'm trying to find a goal to shoot for and new ideas to improve my own engine.

I'm looking at the ability to render as many entities as possible to the screen before it drops the frame rate below 60 FPS.

Honestly, you're going to run into a ton of actual rendering-related bottlenecks before a decently architected game object/entity/"whatever buzzword you want to use" system gets in your way. Don't start theoretically optimizing things. Solve the problems you have at hand to get to the desired frame time for your specific game.

Spread that 16ms out to every system??? Are you trying to run every system at 60FPS? That's wasteful and pointless....

You've obviously never worked on any type of shooter before. Try telling that to people designing any type of competitive game which supports split-second inputs. 16ms in a game update tick can definitely make a difference.

Not to mention there's the option to run your rendering on a seperate thread asynchronously, which most games don't even do anymore.

What are you talking about? There's a ton of games doing async rendering? The era of single-threaded update and render is over!

Still looking for more improvements that will make this even more efficient. But my question is actually really simple, how fast are your engines running and what specific design patterns are you using to improve how many simultaneous entities can exist and be drawn without slowing your engine down. I'm trying to find a goal to shoot for and new ideas to improve my own engine.

This is still a very useless comparison to make. You're acting like every engine is comparable in some shape or form when it comes to performance. They're not. Focus on what your requirements and goals are. Do some profiling, find your bottlenecks, fix them. Rinse and repeat.

I gets all your texture budgets!

Based on all these updates, I'm taking the broader question to be "How many items can I stuff in an ECS game system before it slows down?"

I've worked on systems with well over five thousand articulated models on screen with fully simulated game objects at once before the processing power started to bog down. I've been brought in on contract to help a project with under 200 static model on screen that could barely maintain 30 frames per second on mainstream hardware. And I've worked on about 20 projects that have ranged far between.

The choice to use an ECS game system has absolutely nothing to do with those performance numbers.

BEGIN TEACHING MODE:

The biggest determining factors in performance is how you use your time.

Steam Hardware Survey says about half of gamers today (46.91%) still have 2 physical cores, and they're about 2.4 GHz. So if you're targeting mainstream hardware, you get about five billion cycles per second if you use them all. Each cycle takes about 0.41 nanoseconds, but we'll call it a half nanosecond for easier math.

You lose a big chunk of that to the operating system and other programs. Let's call your share about 4 billion per second, or about 66 million processor cycles per frame. What you do with those cycles is up to you and your game.

Some tasks are extremely efficient, others are terribly inefficient. Some tasks are fast and others are slow. Some tasks can block processing until they are done, other tasks can be "fire-and-forget", scheduled for whenever is convenient for the processor. Sometimes even doing what appears to be exactly the same thing can in fact be radically different things that you didn't know about, giving very different performance numbers for things you didn't think about.

The most frequent performance factor, and usually the easiest to address, is the algorithm chosen to do a job.

There are algorithms that are extremely efficient and algorithms that are inefficient. As an example, when sorting a random collection of values the bubblesort algorithm is very easy to understand but will be slow. The quicksort algorithm is harder to understand but will typically be fast. And there are some more sorting routines out there like introsort that are quite a bit more difficult to implement correctly but can be faster still.

You can choose to use a compute-heavy algorithm when the program is run, or you can change the algorithm to use some data processing at build time in exchange for near-instant processing or precomputed values at runtime. Swap the algorithm to bring the time to nothing, or nearly so. For example, rather than computing all the lighting and shadowing for a scene continuously, an engine may "bake" all or most of the lighting and shadowing directly into the world.

You can often choose to switch between compute time and compute space, similar to that above. Precomputed values and lookup tables are quite common. In graphics systems it is fairly common to encode all the computing information into a single texture, then replace the compute algorithm with a texture coordinate for lookup. Textures for spherical harmonics are commonplace these days; even if artists don't know the math behind them many can tell you how "SH Maps" work and that they improve performance.

Sometimes it is clear to see places with multiply nested loops, places with exponential computational requirements, code that has known-slow algorithms with known-fast alternatives. And of course, the fastest work is the work that is never done.

So you may have an algorithm in place that has n^3 growth. With 5 items it may take 60 nanoseconds, and that's great. With 10 items it may take 500 nanoseconds, that's fine. With 100 items it takes 500,000 nanoseconds, and that is not fine. Swap out the algorithm with some that takes a bit more time per value but has linear performance and those times may become 180ns, 375ns, 3750ns, and all of those are great with a different algorithm.

Algorithm performance sometimes may be reviewed in the source, but other times they may require analysis tools and profiling.

After algorithm selection, one of the biggest performance factors in games is data locality. It has very little to do with ECS, although some ECS decisions can have a major impact on it.

Basic arithmetic from data already available to the CPU can be done quickly. Processor design allows multiple operations to take place at the same time in internal parallel processing ports, so a single basic arithmetic operation can take place in about one-third of a CPU cycle, or about 0.15ns per operation. If you are using SIMD operations and the CPU can schedule them on ports in parallel, it can take one-sixteenth of a CPU cycle per value, or about 0.03ns. Those are amazingly fast, and that is why so many programmers talk about ways to leverage SIMD operations, which you might have heard of under the name MMX, SSE, or similar.

But there aren't many registers and L1 cache lines on the processor, and reading from memory is slow. If the data is in L2 cache there is an overhead of about 7ns or about 20 cpu cycles. If the value is in main memory it takes about 100ns or about 240 CPU cycles.

Cache misses (needing to get something from farther away in memory) and cache eviction (not using what is already in the cache) can completely destroy a game's performance. Jumping around all over memory might not be a bad thing, what matters is cache performance. If you are jumping around all over memory but it is all on the chip's L1 cache it is amazingly fast, jump around on data in the L2 cache and performance drops by a factor of about 100. Jump around on data requiring loads from main memory and performance drops by a factor of about 10,000.

ECS systems tend to jump around frequently, but design of the systems can mean it is jumping all over in L2 cache or jumping all over in main memory. It is a fairly minor design change but it makes about a 10x performance difference.

Two systems that look exactly the same can differ by an order of magnitude in performance based on data locality. Even the same system can suddenly seem to switch gears from fast to slow when data locality changes. You cannot spot the differences in data locality performance by reading the source code alone.

Another major performance factor in games is how you move data around.

You need to move data between main memory and your CPU, between both of them to your graphics cards, to your sound cards, to your network cards, and to other systems. The system bus performance depends quite a lot on the hardware. Cheap motherboards and bad chipsets can move very little data at a time and has slow transfer rates. Quality motherboards and good chipsets can move tremendous amounts of data at a time with rapid transfer rates.

While you probably cannot control the hardware, if you know what you are doing you can coordinate how data moves around.

You can send data around from system to system all the time with no thought or regard for size or system effect. This is much like the highway system: sometimes you have near-vacant roads and can travel quickly, other times you'll have tons of cars saturating the road with all the vehicles sitting at a standstill. Your data will eventually get there, but the performance time will be unpredictable and sometimes terrible.

You can take steps to bundle transfers together and take simple steps to ensure systems don't block each other. This is much like freight trains: huge bundles with cars extending for one or two miles. There is some overhead, but they are efficient.

Or you can take more extreme methods to highly coordinate all your systems and ensure that every system is both properly bundled and carefully scheduled. This is like mixing the capacity of long freight trains with the speed of bullet trains: enormous throughput, low latency, and everything gets moved directly to the destination with maximum efficiency.

Like memory performance, you cannot spot the differences in bus usage performance by reading the source code alone.

There are many more, but those are normally the biggest impact.

These factors by themselves will account for the vast majority of the performance characteristics of an engine. A few minor differences in each of those things mean the difference between a game running at 10 frames per second or running at 100+ frames per second.

Let's say the overhead is an astounding 500 MHz.

Does the rest of your game not fit in the remaining 3.5 GHz? 7.5GHz? 15.5GHz?

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara

This topic is closed to new replies.

Advertisement