What kind of coding standard do you prefer

Started by
82 comments, last by JoeJ 1 year, 4 months ago

Juliean said:
But, since we are talking about the collision-mask thing. What would that really mean? I give each component a “type” (uint8_t), and then have a mask of what collides with what. So then I'd do a bucket-sort, and only collide elements from other buckets that both “type”s agree they should collide with? If so (purely out of interest), how would one combine this with a tree-like structure? Do you just ignore this potential optimization of not even having to check elements at all because they would ignore each other - or do you do something else, like bucket-sort for each tree-node?

Hmm… i have such case for my GI. The BVH is a spatial tree. But the surfels are also grouped by a convex patch of surface. Which is useful because surfels on the same patch can't see each other, so i can spare some visibility rays.
But that's not the same… i realize convex clusters are close spatially, while collision masks are not.

So… basically the mask would cause new dimensions, beside the spatial ones. So you would need to build a tree using a multidimensional sort key? Probably that's the only way. But i'm actually sure it would be better to ignore masks for the collision broadphase, so yes - in other words: Ignore the potential optimization, and check masks only before creating a collision pair.

Now that i have seen for an object count of 100, which you consider to be already high, spatial optimizations are pointless, maybe binning by mask alone would be an actual improvement. Depends on the cost of the bucket sort, and if you would need to do this per frame or only when creating a new object.

On the other hand, and to address your unfounded doubts about trees, : ) notice that the trivial traversal per object i have used above is not the fastest option, just the simplest one. I would even still call this ‘brute force’.
To do better, we can replace the idea to collide an object against the tree, with the idea to intersect the tree with itself. This allows to share traversal work over whole tree branches and so many objects.
Related algorithms are no longer simple, and the memory management to support the caching of multiple traversal stack fragments is challenging, but that's the point where the super powers of trees really start to show \:o/ I tell you - it's mind blowing! ; )

But i have not learned such things from CS books (which i never did read), or other education like research papers (which i did read). Actually i never came across such algorithms in any literature. I assume they exist, and have a name, but idk.
I never had a reason to try this for a problem like collision detection, but i'm sure it would work. Not sure if it would help with N<100, though. But even if, the win would be too small to justify the effort.

Advertisement

JoeJ,

Can you try binary search next for the collision detection?

Idea is that for each dimension individually you do indexed radix sort by object size, and then indexed radix sort by object's starting position. So for two dimensions you'll have two list of indices to objects - that is the acceleration structure.

When doing collision detection for some object you pick that dimension that has the smallest intersection with total bounding volume. When you find your object's pos in the index list then you iterate nearby objects by indexes both left and right until neighbors are in range and do collision detection.

The reason I suggest this approach is because it is simple to keep acceleration structure up to date - just by running two radix sorts that most of the time won't update a thing because objects won't move much. I guess grid structure would be easier to keep up to date but depending how big the space is grid structure is not suitable for many cases. So, another advantage of this method could be that it can handle objects that are highly disproportional in size (big spaceships/small spaceships; big triangles/small triangles in mesh boolean operations)(but then you'll have to track overlapping ranges for each object in the list etc - sorting by end position then sorting by start position and track the range etc).

Ivica Kolic said:
Idea is that for each dimension individually you do indexed radix sort by object size, and then indexed radix sort by object's starting position. So for two dimensions you'll have two list of indices to objects - that is the acceleration structure. When doing collision detection for some object you pick that dimension that has the smallest intersection with total bounding volume. When you find your object's pos in the index list then you iterate nearby objects by indexes both left and right until neighbors are in range and do collision detection.

I think i saw such algorithm in the physics engine i'm using, but not sure. Is it what's called the ‘sweep and prune’ method maybe?

The thing i do not understand about it is: Say we pick the x dimension, and get a range of objects covering the whole y dimension.
How can we take benefit from the sorting in y to reduce this range further?

Or is that just not done, so we're using basically only a 1D acceleration structure?

Idk how radix sort precisely works, and i have issues with understanding ‘sort by size first, then by position’, so i can't add such comparison quickly. But sounds interesting.

Maybe you could add it? My code examples have no dependencies beside a vec2 class, so you could just paste it in.

You still owe me an explanation about ‘make GTA world procedurally with zero effort, and then it fits on a floppy disk’ claims! I must know… ;D

JoeJ said:
You see, not even at 100 objects brute force wins. The timing fluctuations are not as bad as i said, only about 5 ns.

Now since we talked a lot here, I don't remember how exactly I put it with the brute-force originally. But, the results do not entirely suprise me, or are really that much of a counter to what point I was making ?

The O(N^2) I put mainly of an example for simplicity, and what to do when you don't need better approaches. And your results kind of confirm that. Sure, 100 objects are low compared to other use cases, but I'd argue there are enough games out there, special in the indie-scene, that might even have less. And yet, brute-force in your case with those numbers still only runs marginally slower than the other options. Which means that, if we were to half objects, it might win, or at the very least, we are coming into a range where both measuring and the actual time the algorithms take won't matter at all. But I did never expect brute-force N^2 to win at a count of >100 objects, it might have come across like that. N^2 probably isn't as good for speculative execution as well, compared to linear vs binary-search.

But even still, I do think its crazy in a certain regard. Think about it. O(N^2) for 100 objects is 10000, with halfing the comparisons its about 5000 checks. O(N) is only 100. So you are doing 1/100th or 1/50th of the “steps” required for brute force, and yet the speedup at those numbers is only few %. That at least proofs part of my point. If I weren't right on a ground-level, going from O(N^2) to O(N) even at those object-counts should yield a speedup of up to >100x - but you are paying quite a steep cost for building those structures and evaluating them, that is somewhat in the range of doing 4900 additional iterations/checks. Ofc, if you start increasing the object-numbers even higher it becomes more clear-cut towards O(N), but we already agreed on that ?

JoeJ said:
Very surprising to me: std::vector> slightly wins over the list even including the time to build the grid per frame, it seems.

See? Told you ? And btw, I think you do even better, for both cases, but especially for vectors.

You should actually try to reuse the memory. Mark the vector static in both cases, and clear it before executing. I think this is a realistic optimization. After the first frame, speeds will increase drastically. Especially for vector<vector>. You should only make sure to create a local-reference variable to the vector before you loop, as to not pay the cost of static-init checks and loads all the time. If you don't like this, at least you could try alloca-ing the outer “vector”, as you already know its size, and if you do not except enough grid-cells to blow the stack-frame.

Oh, and also std::vector is not optiomal for this task as well. IIRC, std::vector does an allocation in the ctor even if the vector is empty. So, on grid-resize, you are doing a lot of unnecessary initial allocs. Which is (one of the) reaon(s) that I use my own vector-implementation.

Also, but this point is only minor; your objects do seem to always have the “next” pointer built inside them. For the case of vector<vector<>> where you don't need it, removing it might make the collision-checks themselves faster, but this dependens on how objects are allocated and how the rest of the members are defined. If they are all newed separately, and “next” is at the end of the class, it shouldn't matter much. Its more a matter of the point I mentioned before - you are essentially hardcoding the fact that objects are being used in this grid in the base of the objects. Which means that you cannot pop the object into such a structure if it was not intended to - like, for an external user of a library. Or at least, if you didn't do that, you would pay a much higher cost, making vector<vector> even better in comparsion ?

JoeJ said:
Now that i have seen for an object count of 100, which you consider to be already high, spatial optimizations are pointless, maybe binning by mask alone would be an actual improvement. Depends on the cost of the bucket sort, and if you would need to do this per frame or only when creating a new object.

Its the highest the game will probably have in terms of raw colliable objects at the same from a pure gameplay-standpoint, and was enough to already make the shitty Rpg-Maker back when I was using it? There are some cases where object-counts are higher, for example the tilemap can easily have 500000k tiles on the same map, with multiple thousands on the screen at the same time - in the editor, even all could be visible at once! This is the solved via a geometry-shader, because those are the counts where even normal sprite-batching would be way too slow ? But I'd digress.

JoeJ said:

I think i saw such algorithm in the physics engine i'm using, but not sure. Is it what's called the ‘sweep and prune’ method maybe?

The thing i do not understand about it is: Say we pick the x dimension, and get a range of objects covering the whole y dimension.
How can we take benefit from the sorting in y to reduce this range further?

So, if x-size of the object is just 5% of the total X range and y-size is 100% of the total Y range then the obvious choice is X axis. In average case it will prune 95% of the total objects.

This should be good enough for indexed radix sort (It's my very old code not by todays standard but it should be fine):

template <typename DATA_TYPE, typename INDEX_TYPE>
void RadixSortIndexed(INDEX_TYPE* indices, const DATA_TYPE* elements, size_t elementNumber, bool correctNegative = true, INDEX_TYPE* swapIndices = nullptr)
{
    assert(sizeof(DATA_TYPE) % 2 == 0); // NOTE: it works only for KEY_TYPE that has even size (2, 4, 6, 8, 10, .. bytes). That way we don't have to preform last copy.
    static const size_t bucketCount = 1 << sizeof(unsigned char)* 8;
    static const size_t steps = sizeof(DATA_TYPE);

    for (INDEX_TYPE i = 0; i < (INDEX_TYPE)elementNumber; ++i) indices[i] = i;
    bool shouldDeleteSwapIndices = false;
    if (swapIndices == nullptr)
    {
        swapIndices = new INDEX_TYPE[elementNumber];
        shouldDeleteSwapIndices = true;
    }

    INDEX_TYPE* indexA = indices;
    INDEX_TYPE* indexB = swapIndices;

    int bucket[bucketCount];    // U bucket ubacujemo 
    for (auto s = 0; s<steps; ++s)
    {
        std::memset(bucket, 0, sizeof(bucket));
        for (size_t e = 0; e<elementNumber; ++e) ++bucket[((unsigned char*)&(elements[indexA[e]]))[s]];
        int lastBucketSize = bucket[0];
        bucket[0] = 0; // Nema offset.
        for (auto b = 1; b<bucketCount; ++b)
        {
            int curBucketSize = bucket[b];
            bucket[b] = bucket[b - 1] + lastBucketSize; // Prethodni bucket offset + broj elemenata prethodnog bucketa.
            lastBucketSize = curBucketSize;
        }
        // Now we swap sorted from bufferA to bufferB
        for (size_t e = 0; e<elementNumber; ++e) indexB[bucket[((unsigned char*)&(elements[indexA[e]]))[s]]++] = indexA[e];

        INDEX_TYPE* tmpIndex = indexB;
        indexB = indexA;
        indexA = tmpIndex;
    }

    assert(indices == indexA);

    if (correctNegative && elementNumber && elements[indices[elementNumber - 1]]<0)   // Ovdje treba swapati jer su negativni clanovi koji bi trebali biti na dnu zavrsili na vrhu.
    {
        if (elements[indices[0]] < 0) // Svi su elementi negativni pa ih trebamo obrnuti
        {
            int low = 0;
            int high = elementNumber - 1;
            while (low<high)
            {
                INDEX_TYPE tmp = indices[low];
                indices[low] = indices[high];
                indices[high] = tmp;
                ++low;
                --high;
            }
        }
        else
        {
            int border = elementNumber - 1;	// Detekrirajmo gdje poci
            while (elements[indices[border]]<0) --border; // Primjeti ! na pocetku.
            ++border;	// Ne zanima nas prvi pozitivni clan, vec prvi negativni...
            std::memcpy(&swapIndices[elementNumber - border], indices, border*sizeof(INDEX_TYPE));
            // Sada cemo rekreirati novi niz u destination polju (kojeg cemo naknadno prekopirati u src)
            border = elementNumber - border;	// Koliko ih moramo kopirati (negativnih)...
            INDEX_TYPE* src = &indices[elementNumber - 1];	// Sa kraja polja source
            INDEX_TYPE* dst = swapIndices;	// Na pocetak polja destination
            while (border)
            {
                *dst = *src;
                ++dst;
                --src;
                --border;
            }
            std::memcpy(indices, swapIndices, elementNumber*sizeof(INDEX_TYPE));	// Kopirajmo sve nazad u source
        }
    }
    if (shouldDeleteSwapIndices) delete[] swapIndices;
}

You use it like this (I really should add lambda to be used for sorting key):

std::vector<OBJECT> _objects; // they have bounding box
std::vector<int> _sortedObjects[3]; // For each dimension we have object indices sorted by start position. This is our acceleration structure 

// Tmp stuff to be persisted in a permanent memory context object just so we don't allocate these local vectors all the time
std::vector<float> objectKeys; // tmp vector (but reuse vector for every time) - I should ommit this and just use objects with lambda for specifying sorting key (which can be start position, end position or size).
std::vector<int> swapBuffer; // Nice to have persisted so that radix sort doesn't have to use its own memory allocations on each use

// Let's update accelleration structure:

objectKeys.reserve(_objects.size());
swapBuffer.resize(_objects.size()); // Usefull preallocated tmp memory

// For each dimension we will first sort by size (or maybe end position)
for (int dim = 0; dim <3; ++dim)
{
	// Sort by size:
	objectKeys.clear();
	for (auto& o : _objects) objectKeys.push_back(o.GetMax()[dim] - o.getMin()[dim]); // This should be lambda but...
	RadixSortIndexed<float, int>(_sortedObjects[dim].data(), objectKeys.data(), _objects.size(), true, swapBuffer.data());
	
	// Sort by start pos:
	objectKeys.clear();
	for (auto& o : _objects) objectKeys.push_back(o.GetMin()[dim]); // This should be lambda but...
	RadixSortIndexed<float, int>(_sortedObjects[dim].data(), objectKeys.data(), _objects.size(), true, swapBuffer.dat
}

EDIT: maybe sort by size isn't necessary - but it could be useful if start positions and dimensions are quantified.

Next goes code for finding best axis for our object's size, binary search to find its position in the _sortedObjects[choosenAxis] and going left and right and finding all the objects that are touching ours.

Note that objects are also sorted first by size so this can be very useful but additional data has to be persisted (like end pos reach for every object before) so that we can know when to terminate the search. This won't work that well if some object at the very beginning is huge and encompasses the entire space because to get it we'll have to do full traversal which is the worst case scenario (so maybe this data structure isn't the best for vastly different object sizes). But hey, knowing max object size should be good enough to know when to terminate the search.

JoeJ said:

You still owe me an explanation about ‘make GTA world procedurally with zero effort, and then it fits on a floppy disk’ claims! I must know… ;D

I can't confirm or deny anything about that subject but I can act like that is regular order of progression and none will be the wiser (if you catch my drift).

Juliean said:
Now since we talked a lot here, I don't remember how exactly I put it with the brute-force originally. But, the results do not entirely suprise me, or are really that much of a counter to what point I was making ?

Ja ja, i knew you would say this. Though, i can say the same. The tree started to win at lower counts than i had expected. And we agreed it's a matter of N anyway.

Still, i give an example to illustrate what i've initially meant:
My fluid simulator can do something like 60k particles at 60 fps. This includes modeling of collisions, but also viscosity and all that fluids stuff.
If i drain particles to a terrain, the collisions generate a distribution of particles which looks like Poisson Disk samples, which is visually pleasing. And their motion is like flocks, and easy to control.

So you could use that to model an army of soldiers. And your spells, rushing through the army, would cause a trail of corpses, forming nice persistent patterns of red blood from the magic bullet hell projectiles. Wouldn't that be a nice new experience?
I'm a bit joking, and personally i prefer quality over quantity, but there were some games which used this to get attention and finally success from such features. Likely those NPCs could not run an individual script for each, but still.

What i mean is: You either can use modern HW brute force power to do the same old things with less effort, or you can use it to try new things. Both is fine depending on your goals, but due to HW progress slowing down, the latter now becomes more difficult to achieve. And because brute force methods have been exhausted for everything, we won't discover new things anymore this way. So we need to change our minds, even if brute force seemingly is good enough at a current moment.

Likely i should express myself differently, avoiding that ‘brute force’ word, which is not really the point. It's rather my desperate search for new things that i mean, which has no real connection to compute power at all.
The most promising direction i see myself here is replacing character animation with simulation, which is not very expensive, but rather a difficult control problem. So what i really mean is encouraging complexity instead praising simplicity.

Juliean said:
You should actually try to reuse the memory. Mark the vector static in both cases, and clear it before executing.

I thought about that, but then i removed the ‘static’ because i had to clear the whole grid at the start. And if i clear the outer vector, the inner ones would be just deallocated, loosing the potential advantage?
So instead, i'd need to loop over the inner vectors, clearing each of them idividually, i guess.
But then i thought that's probably pointless, and hoped the reallocation likely can get the same memory somehow quicker than my manual fiddling. Would be interesting to try.

You should only make sure to create a local-reference variable to the vector before you loop, as to not pay the cost of static-init checks and loads all the time.

Seems you know better, but i do not understand. Can you make a code example?

Juliean said:
Also, but this point is only minor; your objects do seem to always have the “next” pointer built inside them.

Yeah, but for the comparison i could only store the next pointer in a separate vector. So to be really fair, i would have needed to write the whole code twice.

I'm questioning since a long time if using std::vector for performance critical code is a valid option nowadays. Seems the case, and i don't want to implement my own containers.
But i'm still doubtful. I could try to compare it against the containers of the physics engine…

Ivica Kolic said:
So, if x-size of the object is just 5% of the total X range and y-size is 100% of the total Y range then the obvious choice is X axis. In average case it will prune 95% of the total objects.

Yeah, i remember i thought about such algorithm >20 years ago, when i worked on physics myself. But i rejected the idea because of the 1D limitation, which i did not like back then.
But i will try it out, maybe later in the evening…

Ivica Kolic said:
I can't confirm or deny anything about that subject but I can act like that is regular order of progression and none will be the wiser (if you catch my drift).

Ofc. i don't expect you're 100% sure about the claims. It just sounds you believe in some idea or vision. That's always interesting to hear and discuss.
But you have mentioned it, so you also should give a minimal overview of the idea. No need to reveal secrets, in case you work on it yourself. I'm still all ears… : )

@Ivica Kolic

My coding doesn't really fit into your standard.

As some examples I couldn't get by with 2-byte indexes in most cases since that limits you to 65536 indexes. In fact, I don't use indexes at all in most cases. I use relative addressing half pointers which are integrated into my heap system. IMO some things work better with arrays/vectors than others. You still have to deal with a hole when you remove something. You can copy the last item to fill it, use a free list, collect holes later, etc. Each has ups and downs. Also new and delete can be very fast if you are not using the stock heap. In fact mine are mostly inlined. I could go on but there isn't much point to addressing everything since it would be a huge back a forth discussion, and everything varies with the exact application anyway.

My main point is there is more than one way to skin a cat. If your coding standard is working for your, then that's great. I'm certainly not saying your standard is bad. However, I don't think there is any globally best coding standard that everyone should stick to. IMO everything is situational.

JoeJ said:
So you could use that to model an army of soldiers. And your spells, rushing through the army, would cause a trail of corpses, forming nice persistent patterns of red blood from the magic bullet hell projectiles. Wouldn't that be a nice new experience? I'm a bit joking, and personally i prefer quality over quantity, but there were some games which used this to get attention and finally success from such features. Likely those NPCs could not run an individual script for each, but still.

Oh, definately. There are many cool examples, depending on the game. If I ever finish my 2d-rpg, I want to do something eigther like an RTS or 3D horror, so maybe then my stance changes. And also, as I said, for the case where I do have multiple millions of tiles I already have to come up with more “creative” solutions. Thats just my takeaway, that what optimization you chose not only depends on the type of problem but also in a large way the type of game/application you create in a broader sense ?

JoeJ said:
What i mean is: You either can use modern HW brute force power to do the same old things with less effort, or you can use it to try new things. Both is fine depending on your goals, but due to HW progress slowing down, the latter now becomes more difficult to achieve. And because brute force methods have been exhausted for everything, we won't discover new things anymore this way. So we need to change our minds, even if brute force seemingly is good enough at a current moment.

Yeah, that was one thing I wanted to say to that mindset. I find it similar to the discussion that sometimes pop up about using premade engine/libraries VS writing your own. If your sentiment here was true, that would mean that by people using Unity/Unreal, no new engines would be able to be made anymore, because nobody is doing it anymore. I don't think its a really valid point. You pick the tool for the job, if the tool is easy/simply so be it, if its complicated and hard then thats that. But I think I've already said that, so let me put it another way: Maybe instead of forcing innovation by using complicated methods for simple tasks that don't need them, we have innovation by people like you choosing task where “complicated" solutions are required. I think that is a better way to look at it, no? ?This would kind of be a middle ground where both our views can coexist ?
On second though, I'd say that this is probably the “right way” to go about it. Applying complex algorithms to simply problems won't really give you any insight into how those complex algorithms really perform, nor does it give you the ground to really have to push the limits. If my frametimes is twice as long it is allowed to be, I do have to come up with something, as all cost. If I save a few ns on some occassions but lose some on others, and its all in the “noise” level, then why bother? I could even end up with something worse just because there is so little to be gained and so little place to accurately measure.

JoeJ said:
Likely i should express myself differently, avoiding that ‘brute force’ word, which is not really the point. It's rather my desperate search for new things that i mean, which has no real connection to compute power at all.

Well, we did maybe mix up things in the discussion, here and there :D Do not be mistaken, I'm also driving to innovate, especially on the tool-front and on the performance of this tooling or the engine itself. Just that it requires different solutions/approaches, since its a different problem than what we were actually talking about here. To have a performant game-engine overall, including editor, its more about having a large-scale system that doesn't make operations unncessarily slow, without being able to “simply” improve O-complexity - because a lot of task already have O(1) (doing a complex task for a single asset), or O(N) (doing an operation for a cluster of assets). Thats what it often feels like with Unity, everything just takes 10x as long as it should be ?

JoeJ said:
I thought about that, but then i removed the ‘static’ because i had to clear the whole grid at the start. And if i clear the outer vector, the inner ones would be just deallocated, loosing the potential advantage? So instead, i'd need to loop over the inner vectors, clearing each of them idividually, i guess. But then i thought that's probably pointless, and hoped the reallocation likely can get the same memory somehow quicker than my manual fiddling. Would be interesting to try.

You should not clear the whole vector, but every cell. This might sound counter-intuitive at first, but it is really no overhead compared to your method. With the local vector, you have to do two tasks:

  • Create the vector. This allocates (dim*dim) memory, iterates that many times and constructs new vector each time, resulting in additional allocations (due to std::vector); but at least ctors being called
  • At the end of the frame/function, destroys the vector. This again requires iterating over the whole structure, destroying each individual vector and deallocating all the memory

If you made it static, this would change to (after the first frame):

  • Iterate over the entire vector. Clear every sub-vector, which for pointer-elements should just mean setting the “size” to 0

See? This is way less work without any additional strings attached, also removing all allocations. Thats why I harp so much on measuring and how performance isn't always intuitve, at least not on first glance ?

JoeJ said:
Seems you know better, but i do not understand. Can you make a code example?

When you access static variables, there is a certain overhead. Mainly, since c++11 added thread-safe statics, you have to first check (in a thread-safe way) if the static is initialized, before accessing it. The access itself is only marginally slower, but you still have to tap into some global memory, possibly adding another dereference/worse caching.

To apply my solution, here's what you'd conceptually do:

static std::vector<std::vector<Object*>> grid;

auto& localGrid = grid;

for (const auto* object : objects)
{
    if (...)
       localGrid[coordinate].push_back(object);
}

This optimization only makes sense when accessing a static variable multiple times within the same function, preferably in a large loop, but it does provide a noticable speedup. And as far as I recall, this is not an optimization that most compiled can/will do on their own, unfortunately :( But I'd have to double check.

JoeJ said:
I'm questioning since a long time if using std::vector for performance critical code is a valid option nowadays. Seems the case, and i don't want to implement my own containers. But i'm still doubtful. I could try to compare it against the containers of the physics engine…

I do think it is a valid option. Especially if you have algorithms where you don't need to complex spatial structures, but maybe just collect a bunch of objects in one container. There are things that could be improved - for example, I added an “uninitialized” resize mode that speeds up the case where you just want to memcpy primitive data to the vector; or a “push_back_reserved”, which can be used for the case where you reserved the appropriate amount of capacity and then push back (both as an optimization and an assert - gives improvements of about 2x in debug and 5-10% in release) - this case is very common in my codebase.
The worst part about std::vector is its bool-specialization, which was the main reason I came up with my own vector. I don't really regret it, since I don't often use (stateful) allocators, it also wasn't that hard.

Juliean said:
This optimization only makes sense when accessing a static variable multiple times within the same function, but it does provide a noticable speedup.

I would never had figured out this on my own, thanks for sharing! :D

Will add and and compare later, with a somewhat larger grid…

This topic is closed to new replies.

Advertisement