Optimizing skeletal animation system

Started by
6 comments, last by Armagedon 4 years ago

I just put together my first skeletal animation system, and IT WORKS!!! Which was exciting to see. However I've noticed the performance is abysmal. After profiling I found that for the skinned mesh I'm am using to test it takes between 3 to 4.5ms to fully update all of the bones in the armature…0_0. So I feel like I'm doing something wrong as that seems like an awfully long time. Is there anything within my logic below that could be optimized to get this time down? Really curious now how engines like unreal can manage so many concurrent animations at once.

Below is the relevant code, full source can be found here (can also add other chunks upon request): https://github.com/useless3d/useless3d/tree/skeletal_animation

The “updateCurrentAnimation” method is invoked from my primary loop:

glm::vec3 Armature::calcTranslation(const double& time, const usls::scene::animation::Channel& channel)
	{
		if (channel.positionKeys.size() == 1)
		{
			return channel.positionKeys[0].second;
		}

		size_t currentKeyIndex = 0;
		for (size_t i = 0; i < channel.positionKeys.size() - 1; i++)
		{
			if (time < channel.positionKeys[i + 1].first)
			{
				currentKeyIndex = i;
				break;
			}
		}

		size_t nextKeyIndex = currentKeyIndex + 1;

		double deltaTime = channel.positionKeys[nextKeyIndex].first - channel.positionKeys[currentKeyIndex].first;
		double factor = time - channel.positionKeys[currentKeyIndex].first / deltaTime;

		glm::vec3 start = channel.positionKeys[currentKeyIndex].second;
		glm::vec3 end = channel.positionKeys[nextKeyIndex].second;
		glm::vec3 delta = end - start;

		glm::vec3 returnVal = start + (float)factor * delta;

		return returnVal;
	}

	glm::quat Armature::calcRotation(const double& time, const usls::scene::animation::Channel& channel)
	{
		if (channel.rotationKeys.size() == 1)
		{
			return channel.rotationKeys[0].second;
		}

		size_t currentKeyIndex = 0;
		for (size_t i = 0; i < channel.rotationKeys.size() - 1; i++)
		{
			if (time < channel.rotationKeys[i + 1].first)
			{
				currentKeyIndex = i;
				break;
			}
		}

		size_t nextKeyIndex = currentKeyIndex + 1;

		double deltaTime = channel.rotationKeys[nextKeyIndex].first - channel.rotationKeys[currentKeyIndex].first;
		double factor = time - channel.rotationKeys[currentKeyIndex].first / deltaTime;

		glm::quat start = channel.rotationKeys[currentKeyIndex].second;
		glm::quat end = channel.rotationKeys[nextKeyIndex].second;
		glm::quat delta = glm::slerp(start, end, (float)factor);
		delta = glm::normalize(delta);

		return delta;
	}

	glm::vec3 Armature::calcScale(const double& time, const usls::scene::animation::Channel& channel)
	{
		if (channel.scalingKeys.size() == 1)
		{
			return channel.scalingKeys[0].second;
		}

		size_t currentKeyIndex = 0;
		for (size_t i = 0; i < channel.scalingKeys.size() - 1; i++)
		{
			if (time < channel.scalingKeys[i + 1].first)
			{
				currentKeyIndex = i;
				break;
			}
		}

		size_t nextKeyIndex = currentKeyIndex + 1;

		double deltaTime = channel.scalingKeys[nextKeyIndex].first - channel.scalingKeys[currentKeyIndex].first;
		double factor = time - channel.scalingKeys[currentKeyIndex].first / deltaTime;

		glm::vec3 start = channel.scalingKeys[currentKeyIndex].second;
		glm::vec3 end = channel.scalingKeys[nextKeyIndex].second;
		glm::vec3 delta = end - start;

		glm::vec3 returnVal = start + (float)factor * delta;

		return returnVal;
	}

	void Armature::updateBone(size_t index, double time, glm::mat4 parentMatrix)
	{
		auto& bone = this->bones[index];
		usls::scene::animation::Channel channel;

		for (auto& c : this->currentAnimation->channels)
		{
			if (c.name == bone.name)
			{
				channel = c;
				break;
			}
		}

		auto boneMatrix = glm::mat4(1.0f);
		boneMatrix = glm::translate(boneMatrix, this->calcTranslation(time, channel));
		boneMatrix = boneMatrix * glm::toMat4(this->calcRotation(time, channel));
		boneMatrix = glm::scale(boneMatrix, this->calcScale(time, channel));
		boneMatrix = parentMatrix * boneMatrix;

		bone.matrix = boneMatrix;

		for (auto& c : bone.children)
		{
			this->updateBone(c, time, boneMatrix);
		}
	}

	void Armature::updateCurrentAnimation(double runTime)
	{
		double timeInTicks = runTime * this->currentAnimation->tps;
		double animationTime = fmod(timeInTicks, this->currentAnimation->duration);

		this->updateBone(0, animationTime, this->transform.getMatrix());
	}
Advertisement

Modified animation class to store animation channels in an unordered map, which removes the need to loop through and do string comparison on (potentially) every channel in the animation for every bone. This increased performance by roughly 90%. Processing of animation now takes between 1.5ms and 3ms. So that helps, but many skinned meshes on screen at once is still far from a reality with those numbers.

I feel like there's some better method that can be used to process these animation data that I'm just not aware of. My mind keeps trying to think of a way to use the gpu to do it, but I would need to retrieve the processed matrices to be used by logic on the cpu after they were computed…so I'm not sure how that would work, or if it would be feasible (maybe that's how these big AAA engines do it?).

Anyway, I'm kind of just rambling. If anyone has any insight on how I might speed this up I'm thirsty for links as this is one of those instances where I don't know what I don't know, so even just keywords that could point me in a direction are greatly appreciated.

Below is the updated “updateBone” method if anyone is interested:

void Armature::updateBone(size_t index, double time, glm::mat4 parentMatrix)
	{
		auto& bone = this->bones[index];
		auto& channel = this->currentAnimation->channels[bone.name];

		auto boneMatrix = glm::mat4(1.0f);
		boneMatrix = glm::translate(boneMatrix, this->calcTranslation(time, channel));
		boneMatrix = boneMatrix * glm::toMat4(this->calcRotation(time, channel));
		boneMatrix = glm::scale(boneMatrix, this->calcScale(time, channel));
		boneMatrix = parentMatrix * boneMatrix;

		bone.matrix = boneMatrix;

		for (auto& c : bone.children)
		{
			this->updateBone(c, time, boneMatrix);
		}
	}

After a good nights sleep, I think I may know how to improve this. Seems like I may just need to restructure my data in a way so as not to be burning up so many loop every frame just to find the data that needs to be processed.

There are three things that you can do:

  1. Your translation/rotation/scaling keys are sorted by time, you can use std::upper_bound to find required key, decresing complexity from O(n) to O(log n).
  2. When interpolating keys, use glm::vec4 instead of glm::vec3. The latter is not SIMD optimized in GLM.
  3. Remove uneccesary assignments in updateBone method. You bone matrix is just:
    auto boneMatrix = parentMatrix * glm::translate(calcTransation(time, channel) calcRotation(time, channel) calcScale(time, channel);
    Also make sure that underlying code is compiling to SIMD instricts (i had problem with this before, add #define GLM_FORCE_ALIGNED_GENTYPESto make sure).
  4. Change skeleton bone layout. Instead of having tree layout, store it in single array(or vector) in order of accessing. For example: Root → Torso → Left Arm → Left Hand → Right Arm → Right Hand etc.
    This will reduce the number of cache misses when iterating over skeleton.
  5. Multithread your code. It's quite easy with OpenMP. Take all your meshes that you need to animate, calculate animation and store in temporary array. Something like:
struct Bones
{
    Matrix BonesMatrices[100];
}

std::vector<BonesMatrices> animationMatrices;
animationMatrices.resize(meshes.size());

#pragma omp parallel
for(int i = 0; i < meshes.size(); i++)
{
   animationMatrices[i] = animation.GetAnimationMatrix(meshes[i], time);
}

//Use animationMatrices during rendering

Greatly appreciate the input! Trying these out now.

@armagedon I have implemented suggestion #1 and #4.

Regarding #2 this seemed to have a detrimental impact, but that may have been due to the way I was implementing it (constructing vec4s out of the vec3s, doing the calculation, then constructing a new vec3 to return from the calculated vec4)?:

glm::vec3 Armature::calcTranslation(const double& time, size_t currentKeyIndex, const usls::scene::animation::Channel& channel)
	{
		size_t nextKeyIndex = currentKeyIndex + 1;

		double deltaTime = channel.positionKeyTimes[nextKeyIndex] - channel.positionKeyTimes[currentKeyIndex];
		double factor = time - channel.positionKeyTimes[currentKeyIndex] / deltaTime;

		glm::vec4 start = glm::vec4(channel.positionKeyValues[currentKeyIndex], 0);
		glm::vec4 end = glm::vec4(channel.positionKeyValues[nextKeyIndex], 0);
		glm::vec4 delta = end - start;

		glm::vec4 returnVal = start + (float)factor * delta;

		return glm::vec3(returnVal.x, returnVal.y, returnVal.z);
	}

Regarding #3 I cleaned this up a little, but as the translate() and scale() methods require a reference to an existing matrix, I was unable to calculate the entire matrix with a single assignment:

bone.matrix = glm::mat4(1.0f);
bone.matrix = glm::translate(bone.matrix, this->calcTranslation(time, currentKeyIndex, channel));
bone.matrix = bone.matrix * glm::toMat4(this->calcRotation(time, currentKeyIndex, channel));
bone.matrix = glm::scale(bone.matrix, this->calcScale(time, currentKeyIndex, channel));
bone.matrix = parentMatrix * bone.matrix;

Regarding #5 Still looking into this. I have been building some simple test programs to insure I have a decent understanding of how to work with threads in this manner.

NOW….To my recent revelation which resulted in a 4200% performance increase…RELEASE BUILDS!!!! I've been running under Debug config and hadn't tested any of this with visual studio's Release build configuration. Wow. What a difference. These are more aligned with the results I was expecting to see. I can now place 100 skinned meshes (46 bones each) within a scene all running an animation, and still have 600fps left over. 0_0

Regarding #2: Just store in positionKeyValues glm::vec4 instead of glm::vec3.

Regarding Release build: always test you performance with Release build (unless you need fast debug builds). Optimizer can pull off lot's off inlining/autovectorization etc. and most important, under visual studio compiler, it eliminates lot's of STL checks which hurts performance.

This topic is closed to new replies.

Advertisement