bits as float - "MOVD xmm0, eax" and MSVC nonsense

Started by
13 comments, last by NikiTo 4 years ago

While checking out something'n'other via compiler explorer i noticed that MSVC does

mov [temp], eax
movss xmm0, [temp]

… instead of just …

movd xmm0, eax

Won't really matter performance wise for me, but still - is there some way to stop MSVC from being an absolute idiot and do the sane thing like every other darn compiler i can find? _castu32_f32 seems to be perfect fit - but is not available in MSVC. Is there something else that could work?

--------------------------------------------

Thought about it a bit more before pressing "post" … and got an idea to solve it. Hm, why not include it and post it anyway - what do you think of it?

float _castu32_f32(int unsigned val) {
    float res;
    _mm_store_ss(&res, _mm_castsi128_ps(_mm_loadu_si32(&val)));
    return res;
}

1. _mm_loadu_si32 - move int into the beginning of a m128i (first 32 bits in m128 are what floats internally are - let's face it, the ancient FPU floats are so horrible that no-one uses them)

2 _mm_castsi128_ps - does nothing (just switches the type sugar coating for the compiler)

3 _mm_store_ss - does nothing (move m128 to float - where the later is internally actually also m128 anyway)

Works perfectly in release, but is terrible (of course) when optimizations are not enabled. Ignoring all the function call code and its extensive checking - this gem remains:

movd        xmm0,dword ptr [val]        // do the thing (_mm_loadu_si32)
movdqa      xmmword ptr [rsp+50h],xmm0  // store it in preparation of doing nothing
movaps      xmm0,xmmword ptr [rsp+50h]  // load to do nothing (_mm_castsi128_ps)
movaps      xmmword ptr [rsp+60h],xmm0  // store it in preparation of doing nothing
movaps      xmm0,xmmword ptr [rsp+60h]  // load to do nothing (_mm_store_ss)
movss       dword ptr [res],xmm0        // store it in result after doing nothing

Amusing. Also, shows/confirms what the compiler sees.

Advertisement

You are brave to read the ASM output of the compiler and post it on the internet. Some people worship compilers as gods. They will feast on your flesh and not even bones will be left.

tanzanite7 said:
Won't really matter performance wise for me,

If i were you, i would just ignore it and keep coding. Too much work to teach a horse fly as a bird.

tanzanite7 said:
Won't really matter performance wise for me, but still - is there some way to stop MSVC from being an absolute idiot and do the sane thing like every other darn compiler i can find? _castu32_f32 seems to be perfect fit - but is not available in MSVC. Is there something else that could work?

You should file a bug-report/feature-request with microsoft. Show them your code and the compiler-explorer output, as well as the asm generated by other compilers, and maybe have them fix/improve this situation

NikiTo said:
You are brave to read the ASM output of the compiler and post it on the internet. Some people worship compilers as gods. They will feast on your flesh and not even bones will be left.

Ironically, the people who praise compilers over handwritten ASM are the same people that look at the generated ASM in tools like compiler explorers, and realise that eigther the result is already pretty optimal, or can be improved by the compiler-writers. In the end of the day, compilers can produce the same/better assembly than you. Even if they won't today for a given code, there is no reason why the same procedures you use for turning source into ASM can not be implemented as an optimization-step in the compiler. I hope that one day you realise that.

Some surprising results have emerged. Decided to add the _castu32_f32 implementation to my project and used it in random float number generation. For debug purposes also added an alternate implementation for debug build (to keep my massive use of random floating point values to sane speeds when debugging) via preprocessor directives. Which lead me to test it with real code in release build.

Using “vec3 Rng::onSphere(float radius)” for benchmarking - which returns an unbiased random point on given sphere. Including the overhead for fully using the returned vector (to prevent optimizations from cutting anything out) and looping it 10mil times - “(*(float*)&val)” ends up making the whole thing ~8% slower than using my implementation of “_castu32_f32(val)” (both cases in release build with full optimizations of course).

Did not see that coming. Still won't matter performance wise to me … but interesting. That unnecessary memory temporary has surprisingly noticeable impact.

NikiTo said:

If i were you, i would just ignore it and keep coding. Too much work to teach a horse fly as a bird.

Yeah, a bit of a detour down the rabbit hole. I don't even quite remember how i ended up there.

Juliean said:

You should file a bug-report/feature-request with microsoft. Show them your code and the compiler-explorer output, as well as the asm generated by other compilers, and maybe have them fix/improve this situation

NikiTo said:
You are brave to read the ASM output of the compiler and post it on the internet. Some people worship compilers as gods. They will feast on your flesh and not even bones will be left.

Ironically, the people who praise compilers over handwritten ASM are the same people that look at the generated ASM in tools like compiler explorers, and realise that eigther the result is already pretty optimal, or can be improved by the compiler-writers. In the end of the day, compilers can produce the same/better assembly than you. Even if they won't today for a given code, there is no reason why the same procedures you use for turning source into ASM can not be implemented as an optimization-step in the compiler. I hope that one day you realise that.

The forums posting interface silently completely broke - as usually. Since it is kind of usual i happened to copy paste my reply out before posting and can now re-add the part that silently evaporated. Won't edit the reply itself as the mess that shows up when i try to edit tells me to stay far away from attempting that.

So, the missing part where i replied about reporting it:

That is a bit of doing, but i have done that before quite a few times already and got stuff fixed as a result - so, i think about it.

Juliean said:
Ironically, the people who praise compilers over handwritten ASM are the same people that look at the generated ASM in tools like compiler explorers, and realise that eigther the result is already pretty optimal, or can be improved by the compiler-writers. In the end of the day, compilers can produce the same/better assembly than you. Even if they won't today for a given code, there is no reason why the same procedures you use for turning source into ASM can not be implemented as an optimization-step in the compiler. I hope that one day you realise that.

I know you don't accept quotes from other people. Only what you believe matters, but here it goes anyways-

https://stackoverflow.com/questions/577554/when-is-assembly-faster-than-c

Definitively nobody in the top 5 answers said “compilers are ALWAYS faster than ASM”. Definitively not what you say. Nothing at all like the things you claim.

Why don't you learn to code in ASM? Maybe you will end up loving ASM if you understand it. I don't suggest you to use it, because it is not so productive as using compilers. Just understand it, before you blame it.

@nikito : Why can't you be more relaxed ? Nothing of what you put forward was said by @juliean . Neither did they say or even remotely mean that compilers were allways faster than hand written assembler nor did they claim anything. The deduction that “in the end a compiler can produce better/faster code than you” is a completely valid assertion (note the "can"). And, btw, it is a common place. I must also say, stackexchange is by no means a high level or even trustworthy source, there is so much bunk e.g. on the geoscience or history departments, it is ridiculous.

I have the feeling that your fox/grape metaphor shows signs of psychological projection. To convince me that you're so high above us, why don't you show off one of your assembly creations so we can let it run against a high level version. But please something genuine, not just a Fibonacci sequence …

@Green_Baron Because two weeks ago, Juliean and another 3k reputation member bullied me out of this forum. I know what i am talking about.

(ignoring my quotes a 3rd time, baron?…. Only you and the high reputation members are correct? Eh?

Here it goes another link anyways-

https://www.kaspersky.com/blog/the-history-of-programming/1356/

Let me guess, baron…. The people from Kasperski, Linus Torvalds and the whole Stackexchange are nothing compared to your own opinion and the opinion of the high reputation members you try to seduce. A peculiar personality you have there…)

(I am not going to reveal my real life portfolio here and reveal my real personality here only to show you some of my ASM code that you are anyways unable to read. What will be your next dirty trick, baron? - to tell(to lie to me) me: “your code doesn't compile on my computer, so i can not measure it myself so i don't believe you so you are wrong and i am correct”… )

My posture was always - ASM is faster, smaller but less productive.
If you still debate that claim, i will ignore you, because your level of programming skills and understanding are too low compared to mine. No point of even talking to you.

I can predict what your next comment will be like, saying something about my grammar maybe. But you and I, we both know this is a public place and a lot of other random people will read this and see through the real you.

This topic is closed to new replies.

Advertisement