Advertisement

3 quick ways to calculate the square root in c++

Started by October 22, 2019 01:17 AM
37 comments, last by l0calh05t 5 years ago
17 hours ago, Juliean said:

The "static" member also does everything but give you speed. Under optimization, both with or without it you get the same code (https://godbolt.org/z/ZBz6MM), because the compiler can figure out that you not using the value of the unoin in subsquent function call. But if itdidn't, that static would make the code much slower. You save one increment of the stack-pointer, but you pay the cost of heap-access, as well as (at least) a check on each access for multithreaded initialization

100% this. Which is why I removed it from my version :) 

Hola chicos

Since I created this post, I've only heard things I already knew, such as undefined behaviors, static members, trap representation in c ++ and modern compilers. I just want to see more efficient ways to calculate the square root. and since the creation of this publication, nobody has placed any sample code. chao

Advertisement
11 hours ago, gustavo rincones said:

Since I created this post, I've only heard things I already knew, such as undefined behaviors, static members, trap representation in c ++ and modern compilers. I just want to see more efficient ways to calculate the square root.

I quickly reread everything without paying much attention to details, so sorry in advance if I missed something.

You shouldn't forget, that this is a forum for game developers. A topic like yours is better suited for a computer science forum. As I wrote in my previous posts, I doubt that it is really relevant for most games since the number of square root calculations is usually comparatively low. If it gets relevant, the fast math compiler flag is most certainly the chosen solution for such a low-level problem, especially if the alternatively presented ways violate c++ coding rules. I don't think that any nowadays game developer will spend a lot of time in optimizing a square root calculation. He will probably rely on the functionalities of the utilized math library, where math experts already pulled off any possible trick. Only if a benchmark recognizes that the square root is the reason for a major fps hit, he might get interested in alternative formulations. But in this case, he is probably looking for ways to remove the square root entirely.

Also, what your post is missing to make it interesting for many people here are hard numbers. You give some advantages and disadvantages but that doesn't tell much. Pick some specific examples, benchmark them and compare the resulting floating-point error. This is what most people here are interested in (if they are in the first place) and what they can work with. It will enable them to decide if one of the presented methods is a true alternative pick for them. Otherwise, they are just some formulas that aren't written in a c++ standard-compliant way. By the way, this explains why you got the answers you got.

 

11 hours ago, gustavo rincones said:

and since the creation of this publication, nobody has placed any sample code.

Don't get me wrong, I like this low-level stuff but I think that there are just a few people here that can actually provide code for this problem and are also interested in this low-level topic. Most people can only comment on the presented code since they had never tackled this specific problem because it was never necessary. Computer games are such complex beasts that you avoid spending time on such topics unless you really have to address them. Since there is the standard solution, which is pretty damned fast, why should one even bother, except of pure curiosity? Just take a look here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_rsqrt_ps&expand=4803

The standard reciprocal square root for vector registers takes just 4 flops to be finished on a skylake processor. That's the same time as for a single addition or multiplication. Why should a game developer even try to optimize such a fast operation unless he makes a game about calculating square roots?

 

Long story short:

The problem you are addressing is not really a concern for nowadays game developers. It is more an interesting scientific problem. Hence, the chances that somebody here can really discuss this problem with you in the way you want are rather low.

 

Greetings and good luck

 

EDIT:

An additional piece of information:

http://www.tommesani.com/index.php/component/content/article/2-simd/61-sse-reciprocal.html

Just had a short look, but it seems that the SSE reciprocal square root is hardware accelerated ("These instructions are implemented via hardware lookup tables"). So you have probably no real chance to get much faster than that. Only a better accuracy to performance ratio might be the reason to pick an alternative function.

 

EDIT2:

You can compare performance and accuracy of your implementations against this function:


#include <x86intrin.h>

float FastInvSqrt(float val)
{
    __m128 reg = _mm_set1_ps(val);
    return _mm_cvtss_f32(_mm_rsqrt_ss(reg));
}

 

It uses the hardware-accelerated vector register function to calculate the reciprocal square root. Haven't benchmarked it myself (maybe I'll do it later), but I think you will see, that there is not really a point in trying to implement an own square root (at least for the reciprocal) as a game developer. If you need accuracy, you will use the standard `sqrt`. If you need raw speed, you will probably use the hardware-accelerated version unless it is not available on your processor.

 

 

 

On 11/7/2019 at 9:04 AM, DerTroll said:

For: "DerTroll"

 

On 11/7/2019 at 9:04 AM, DerTroll said:

I must admit that it is a good answer for a good question. I really liked your answer: "DerTroll"

On the other hand, I only use the square root when I normalize the vectors and obtain magnitudes and, of course, whenever I can avoid the square root I do it, for example: "the comparison of the superposition of 2 circles". One thing is correct, there are already many problems when it comes to optimizing MLCP solvers in physics engines to create algorithms to calculate the square root! It's true! I think it's a good reason not to touch the low level issue of compiler tricks.

 

 

On 11/7/2019 at 1:07 PM, gustavo rincones said:

Hola chicos

Since I created this post, I've only heard things I already knew, such as undefined behaviors, static members, trap representation in c ++ and modern compilers. I just want to see more efficient ways to calculate the square root. and since the creation of this publication, nobody has placed any sample code. chao

Have you benchmarked these approximate functions vs the actual hardware sqrt/rsqrt instructions on modern CPUs?

I was under the impression that these tricks were amazing in the 90's, but aren't that useful any more?

Okay, hate to say it, since you probably put a lot of effort into research and testing, but it seems like @Hodgman is right and there is absolutely no reason to use the functions you presented. I benchmarked it using google benchmark. Here is the code so you can test it yourself.

 


#include <benchmark/benchmark.h>
#include <cmath>
#include <x86intrin.h>



// Functions ----------------------------------------------------------------------------------------------------------

float FastInvSqrt(float val)
{
    __m128 reg = _mm_set1_ps(val);
    return _mm_cvtss_f32(_mm_rsqrt_ss(reg));
}



float FastSqrt(float val)
{
    __m128 reg = _mm_set1_ps(val);
    return _mm_cvtss_f32(_mm_rcp_ss(_mm_rsqrt_ss(reg)));
}



float Sqrt1(const float& n)
{
    static union {
        int i;
        float f;
    } u;
    u.i = 0x5F375A86 - (*(int*)&n >> 1);
    return (int(3) - n * u.f * u.f) * n * u.f * 0.5f;
}



float Sqrt2(const float& n)
{
    union {
        int i;
        float f;
    } u;
    u.i = 0x1FB5AD00 + (*(int*)&n >> 1);
    u.f = n / u.f + u.f;
    return n / u.f + u.f * 0.25f;
}



float Sqrt3(const float& n)
{
    static union {
        int i;
        float f;
    } u;
    u.i = 0x2035AD0C + (*(int*)&n >> 1);
    return n / u.f + u.f * 0.25f;
}



// Actual Benchmarks --------------------------------------------------------------------------------------------------

static void SquareRoot(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = std::sqrt(a));
}
BENCHMARK(SquareRoot);



static void ReciprocalSquareRoot(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = 1. / std::sqrt(a));
}
BENCHMARK(ReciprocalSquareRoot);



static void SquareRootSSE(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = FastSqrt(a));
}
BENCHMARK(SquareRootSSE);



static void ReciprocalSquareRootSSE(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = FastInvSqrt(a));
}processor
BENCHMARK(ReciprocalSquareRootSSE);



static void CustomSquareRoot1(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = Sqrt1(a));
}
BENCHMARK(CustomSquareRoot1);



static void CustomSquareRoot2(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = Sqrt2(a));
}
BENCHMARK(CustomSquareRoot2);



static void CustomSquareRoot3(benchmark::State& state)
{
    float a = 100000000.0;
    for (auto _ : state)
        benchmark::DoNotOptimize(a = Sqrt3(a));
}
BENCHMARK(CustomSquareRoot3);

BENCHMARK_MAIN();

If you are not familiar with google benchmark (documentation), the command `benchmark::DoNotOptimize` doesn't prevent compiler optimizations on the benchmarked expression itself. It just forces a 'flush' to memory (See linked documentation). Benchmarking these low-level functions is not as easy as one might think since nowadays compilers optimize quite aggressively to achieve maximum performance. However, let's get to the results.

My Processor: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

GCC release build with -O3 gives:


---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
SquareRoot                       5 ns          5 ns  145729163
ReciprocalSquareRoot             8 ns          8 ns   91433916
SquareRootSSE                    3 ns          3 ns  207794538
ReciprocalSquareRootSSE          2 ns          2 ns  298096066
CustomSquareRoot1                8 ns          8 ns   84971507
CustomSquareRoot2               11 ns         11 ns   64788053
CustomSquareRoot3                7 ns          7 ns   98092852

 

Clang release build with -O3 gives:


---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
SquareRoot                       5 ns          5 ns  113007689
ReciprocalSquareRoot             7 ns          7 ns   93402035
SquareRootSSE                    3 ns          3 ns  207083227
ReciprocalSquareRootSSE          2 ns          2 ns  295002686
CustomSquareRoot1                8 ns          8 ns   83792728
CustomSquareRoot2               10 ns         10 ns   70934801
CustomSquareRoot3                6 ns          6 ns  117465124

 

`SquareRoot` is just `std::sqrt` and `ReciprocalSquareRoot` is `1./std::sqrt`. `ReciprocalSquareRootSSE` is the fast vectorized version I showed you in my previous post. `SquareRootSSE` uses an approximation function to calculate the reciprocal of the reciprocal square root since there is no approximation function for the square root itself (at least that I know). Just using `1.f /  ReciprocalSquareRootSSE` produced slower results than the accurate square root. Using 2 approximations might increase the error additionally. However, in my test, the errors partially canceled each other. But I guess it was just a lucky test case.

The other functions are the ones from your starting post. As you can see, all of your proposed functions are slower than `std::sqrt`. Only the last one is close to `std::sqrt` but with lower accuracy. The fast SSE versions are  both faster than the accurate versions, but the error is not too small:

Results for sqrt(5) ---> accurate: 2.23607 - fast: 2.23633.

Results for 1.f / sqrt(5) ---> accurate: 0.447214 - fast: 0.447144.

Didn't test the errors of your functions, since they already failed the benchmarks. Even though it seems like the gain of the SSE square root isn't really that high, it might get much faster if you calculate multiple square roots at once since _mm_sqrt_ps has just a throughput of 4 while the approximations have a throughput of 1. In this case, some of your functions might get a little speed boost compared to the accurate version but looking at the dependencies of the involved operations I doubt that it will be much.

I think the problem is that modern processors have hardware optimized functionality to calculate square roots. Unless you find a brand new extremely simple formula to calculate the square root, you have no chance of beating hardware optimizations. So there is no point in programming functions that are already provided by your hardware.

Greetings

 

Advertisement

I have tested around 10 methods a decade ago with gcc on athlonxp (-march=pentium3, -O3), running a few million normalizations as a payload. and the fastest way to calculate square root was just simply using the sqrtf function (which calls the cpu-s buiilt in sqrtf instruction) and it was 5-80x faster than any other methods, as this instruction is properly implemented since ages. I have never tested it on ARM, so no clue about that architecture.

Microbenchmarks and micro-optimizations do have their place.

How many places in your code are square roots even performed? Do they ever become a blip on the profiler?

Some pieces of physics code and graphics code is a bottleneck, and some of that code needs this kind of thing. But normally performance issues in 2019 are due to data management, caches, and broad algorithm choices.

@DerTroll you probably shouldn't copy the "const &" on the floats from the original post which is a clear pessimization. Although I doubt it will be significant enough to make it faster than the hardware variant, especially since those can be made even faster, as most compilers issue an (unnecessary in this case) shuffle before the rsqrt.

https://godbolt.org/z/ZMNzb5

3 hours ago, l0calh05t said:

@DerTroll you probably shouldn't copy the "const &" on the floats from the original post which is a clear pessimization.

I haven't seen that. I just copied the functions without paying much attention to the implementation details. Changed it and rerun the benchmarks. No effect for GCC. Clang got actually even slower. No clue why. However, I think modern compilers are quite good at optimizing copy by value/ref function signatures. But that's just a guess. I have not tested it, but to my experience, every "obvious"  optimization is already performed by the compiler.

 

4 hours ago, l0calh05t said:

Although I doubt it will be significant enough to make it faster than the hardware variant, especially since those can be made even faster, as most compilers issue an (unnecessary in this case) shuffle before the rsqrt.

https://godbolt.org/z/ZMNzb5

Well, icc and clang are smart enough to optimize the _mm_set1_ps away. Also tried using _mm_set_ss but it didn't change much except GCC replaced the shuffle by an insert. However, if you compare the timings of GCC and Clang you see no real effect on the performance. That's a little bit confusing to me since 5 flops (latency of inv sqrt + shuffle) vs 4 flops should have a clear impact (25%), but I guess the compilers are inlining the function in the benchmarks and might perform some other optimizations in this context. I don't really know since micro benchmarking such low-level stuff is always a fight against the aggressive compiler xD.

Greetings

This topic is closed to new replies.

Advertisement