Hi,
I was looking for a way to effectively interpolate between skeletons represented by list of bones with position vector + orientation quaternion pairs. To my surprise, I couldn't find any good solution, not to my taste that is. I can't use NLERP because I need a general solution, and NLERP is only good for small distance interpolations. SLERP would be perfect, but I couldn't find a decent implementation, only slow ones, so I spent a lot of time looking at game engines and quaternion libraries and reading academic articles on the topic. Finally, I decided to collect my findings and write my own optimized version of SLERP, which focuses on performance over precision, and which I now would like to share with you.
Optimized cross-platform SLERP
I've started from the well known formula, as implemented by most C++ libraries, and used all the tricks I could find on the net. Then I took it to the next level, and made some more adjustments on my own speeding up a little bit more. Finally I ended up with two solutions:
1. one that is cross-platfrom, ANSI C solution that is easily embeddable in C++ and compatible with any C / C++ quaternion class (provided their data can be seen as 4 floats), very simple, very minimal. Use it freely if you want, MIT licensed
2. a SIMD version, which I had to leave half-ready, because my compiler and Intel disagree on what intrinsics should be provided for SSE, moreover the compiler does not work the way its documentation say it should :-( Shit happens. I would only recommend this for experimental purposes, also MIT licensed
Both versions are simpler and much faster than any implementations I could find, designed to be called in a loop several times. The only dependency they have is lib math (acosf, sinf). The SIMD version is half-ready, but even in this form it's almost 1.5 as fast as the other one, especially if you call it in a loop with memory prefetch on the quaternions.
If I made any mistake, miscalculated or mistyped something, let me know. If anybody has an idea how to overcome that intrinsics blockade I have, that would be appreciated very much, because I can clearly see the path how to make the code several times faster, only if I could get rid of the library calls. I also plan to create an ARM NEON port once I have that.
Cheers,
bzt