Back to General and Gameplay Programming

Best practice for CPU detection / dynamic dispatch for SSE

General and Gameplay Programming Programming sse simd

Started by lawnjelly December 09, 2019 03:58 PM

7 comments, last by Adam_42 4 years, 5 months ago

lawnjelly

2,021

Author

December 09, 2019 03:58 PM

I've just been writing a small library to add some SIMD functionality for common functionality to Godot:

https://github.com/lawnjelly/godot-simd

I'm now detecting the CPU caps for x86 / x86_64, so that I can choose a different codepath at runtime. The problem I am facing is that GCC is complaining, e.g.

/usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h:56: error: inlining failed in call to always_inline '__m128 _mm_hadd_ps(__m128, __m128)': target specific option mismatch

When I try to use SSE function that is above that that the compilation flags are for (in this case SSE3 instruction). I can *solve* this by adding e.g. -mAVX to the compilation flags, however I am worried that this will allow e.g. AVX instructions to be used throughout, which I don't want because it will crash older CPUs.

I never had this problem before (I don't remember doing anything special to get around this before, and successfully would be able to run on older CPUs and newer). I would have expected sensible compiler behaviour would be to follow the flags, unless intrinsics were explicitly used (in which case allow the intrinsic instructions). Apparently not though.

Anyway I found a mention of this in the gcc docs:

https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html

-mmmx
-mno-mmx
-msse
-mno-sse
-msse2
These switches enable or disable the use of instructions in the MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, AES, PCLMUL, SSE4A, FMA4, XOP, LWP, ABM or 3DNow! extended instruction sets. These extensions are also available as built-in functions: see X86 Built-in Functions, for details of the functions enabled and disabled by these switches.
To have SSE/SSE2 instructions generated automatically from floating-point code (as opposed to 387 instructions), see -mfpmath=sse.
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.
These options will enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications which perform runtime CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options.

They seem to recommend compiling a different file for each SSE version, and altering the flags each time to limit the architecture to that file only.

My questions is, is this a good approach, and is it cross platform? I need to get things compiling with clang and visual studio in addition to gcc.

The docs are a bit unclear on this. Does -mavx make the compiler generate its own avx (or alternatively is it just allowing you to use it)? Or does -marchavx do this? If so can I simply use -mavx -marchsse2 in combination to limit compiler generated code to sse2, while allowing avx with intrinsics?

lawnjelly

2,021

Author

December 09, 2019 04:48 PM

Well to somewhat help I've found this blog post which is instructive:

https://randomascii.wordpress.com/2016/12/05/vc-archavx-option-unsafe-at-any-speed/

It seems like the Google chrome team had a similar problem trying to optionally add AVX, they were compiling certain files with a different arch flag. And this caused major problems when the AVX file pulled in helper functions from other libraries (in this case math.h). That's a serious gotcha to watch out for...

But it does sound like isolating in different cpps (or more precisely translation units) is the way to go.

frob

46,227

December 09, 2019 10:46 PM

undefined said:
My questions is, is this a good approach, and is it cross platform? I need to get things compiling with clang and visual studio in addition to gcc.

Compiler options are not cross platform.

I'd suggest that it is not a good approach unless you actually have that old hardware and the various platforms hanging around. Companies are willing to support many different builds, but individuals generally can't afford it. If you have some linux boxes, various PC boxes, and various macs lying around then go for cross platform all you want.

Figure out what your target machine is. As a starting point, it's the one you use daily, and others you have access to. Make that your minimum.

AVX has been in all the x86 chips since 2011. If this were 2012, or even 2014 I'd recommend against it because it is too new. At nearly nine years old it's quite easy to require it. Similarly, build your code in 64-bit mode and it automatically includes MMX, SSE, SSE2, and I believe also SSE3, SSSE3 and AVX.

On modern Visual Studio for 64-bit options, your options are AVX (the minimum), AVX2, and AVX512.

Shaarigan

1,471

December 10, 2019 07:50 AM

And if your code is written in a way to make it easy for the compiler to recognize optimizations it is also very likely that it will convert them into SSE instructions. Have tested this with a matrix multiplication. Unrolling the loop by hand in some way made it possible for LLVM/clang to optimize it into SSE and gave a performance boost of 1.5 Mio 4.4 Matrix multiplications per second

lawnjelly

2,021

Author

December 10, 2019 09:05 AM

Shaarigan said:
And if your code is written in a way to make it easy for the compiler to recognize optimizations it is also very likely that it will convert them into SSE instructions. Have tested this with a matrix multiplication. Unrolling the loop by hand in some way made it possible for LLVM/clang to optimize it into SSE and gave a performance boost of 1.5 Mio 4.4 Matrix multiplications per second

Yes, we so far have been relying on autovectorization, I have confirmed it works with simple operations (with -O3), especially when you create your data in a SIMD friendly manner. If you have a look my first test version of the library I posted has no intrinsics.

There are a few limitations to autovectorization though

Can't get it to work with certain operations, like reciprocal square root
If you want to use it for anything above your base level SSE support, you would have to use workarounds like the separate translation unit with a different -march setting as I mentioned above

In practice we already have autovectorization throughout for SSE2 on x86_64, I don't know if there's any for x86 because I couldn't confirm our min spec the other day when I asked. It should be nothing, but I wouldn't be surprised if one or two third party libraries assume SSE2.

frob said:
Compiler options are not cross platform. I'd suggest that it is not a good approach unless you actually have that old hardware and the various platforms hanging around. Companies are willing to support many different builds, but individuals generally can't afford it. If you have some linux boxes, various PC boxes, and various macs lying around then go for cross platform all you want.

Oh sure, I might have to use slightly different switches on different platforms, but I was referring to the general approach of having an e.g. -marchAVX setting for particular source files, whether any of you guys had used this approach. It seems to have worked for google chrome anyway.

I've done CPU detection and dynamic dispatch for different levels of SSE before on just x86 with the microsoft compiler. I think it must have just allowed you to use intrinsics that were above the level you were compiling for (can't confirm this offhand), whereas gcc is complaining.

I have previously used old machine with limited SSE support to test this kind of thing, I also have various Android devices for testing Neon. I wouldn't support Apple, I'd let someone else do the testing for that. You could also probably even scan the executable for opcodes that use the wrong version of SSE to test this kind of thing rigorously (not sure if there are any tools that do this).

Worse case even without detection and dynamic dispatch, we would know at compile time that x86_64 supports SSE2 so that is useable throughout, and on Armv8 Neon is mandated so could safely be used.

frob

46,227

December 10, 2019 04:44 PM

Yes, I'd call AVX part of any modern minimum spec.

That puts you on hardware from 2011 and later, so Intel's Core2 2xxx series and later, or AMD's Bulldozer and later. While you can certainly find earlier processors out there, they wouldn't be part of any modern gaming rig.

frob

46,227

December 10, 2019 04:51 PM

Addendum: But if you feel strongly about going back to any X86_64 processor, that includes SSE2 guaranteed, but SSE3 wasn't on the very earliest x64 capable machines. Relatively few modern programs will work on that, even modern web browsers include newer CPU instructions, but that's an option if you're trying to support museum-aged hardware.

Adam_42

3,664

December 13, 2019 05:29 PM

For choosing what instruction sets to use, the steam hardware survey (https://store.steampowered.com/hwsurvey/) is helpful - look in the "other settings" section. For example, it will tell you that about 10% of Steam users don't have AVX capable CPUs.

Best practice for CPU detection / dynamic dispatch for SSE

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Best practice for CPU detection / dynamic dispatch for SSE

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines