Handling "float" in in generic memory (blob)

7,351

Author

May 22, 2021 12:53 PM

Hello,

so for my bytecode/interpreted language, I have support for primitive types - mostly byte, int and float. For a while I've been treating int and float separately - there are multiple operations, like loading from and storing to a local variable:

case OpCodes::LoadInt:
{
	const auto offset = stream.ReadData<LocalOffset>();
	const auto value = m_stack.GetValue<int>(state.pFrame, offset);

	m_stack.Push(value);
	break;
}
case OpCodes::LoadFloat:
{
	const auto offset = stream.ReadData<LocalOffset>();
	const auto value = m_stack.GetValue<float>(state.pFrame, offset);

	m_stack.Push(value);
	break;
}
case OpCodes::StoreInt:
{
	const auto offset = stream.ReadData<LocalOffset>();
	auto& ref = m_stack.GetRef<int>(state.pFrame, offset);
	ref = m_stack.Pop<int>();

	break;
}
case OpCodes::StoreFloat:
{
	const auto offset = stream.ReadData<LocalOffset>();
	auto& ref = m_stack.GetRef<float>(state.pFrame, offset);
	ref = m_stack.Pop<float>();

	break;
}

The reason that I originally that it that way, is that I saw a lot of languages do it that way (Java for example), and I didn't think much of it.

Now that my language is pretty evolved, I'm trying to converse space in the “OpCodes", so that they can stay 8 bit (I'm currently using 222 out of 255). And that got me thinking - is there actually any benefit for treating “float” explicitely in a situation like the above? I'm thinking about changing the instructions above to “LoadWord" and “StoreWord”, which would handle word-sized variable, properly via ind or uint, instead of separate int/float.
I'm just not sure if its a good idea. I know from testing that in general reinterpret_casting a the content of a float to an int works and preserves certain aspects/operations (equality/ordering). But on the other hand, the c++-compiler always generates specific instructions/registers (XMM) for dealing with floating-point types. So is it actually advantageous to always treat float-data as “float”, or is the generated floating-point-assembly just better when dealing with large-scale floating point operations (like a full functions of float-operations following each other; which is not the case in my bytecode)?

Hope my question/concern makes any sense, perhaps from somebody who knows a bit more about the inner workins of CPUs and/or IEEE-standards and what not.

Oberon_Command

6,371

May 22, 2021 08:14 PM

Juliean said:
I'm just not sure if its a good idea. I know from testing that in general reinterpret_casting a the content of a float to an int works and preserves certain aspects/operations (equality/ordering)

It may appear to work, and some shipping code has depended on it (the famous inverse square root from the Quake III codebase, for instance), but it's actually undefined behavior to reinterpret cast an int as a float and vice versa. I believe this is the case even in C, using C-style casts - it's just so common that most major compilers will “probably” let you get away with it in “most” cases (because of the sheer quantity of existing code that would break if this were enforced strictly), but it could cause the optimizer to produce some very strange results, if for example it decided that the UB can't happen and optimized out all of your actual code as a result.

If you're going to be moving memory around as opaque bytes, you'll want to treat it explicitly as raw bytes, and be sure you only cast it to the type you know it is back from raw bytes. That is what the std::byte type (introduced in C++'17) is for. In C and in earlier standards, chars serve the same function.

If you're in a situation where you need “type punning”, in general, I encourage you to watch this CppCon talk on this exact subject. The float/int punning case is mentioned explicitly at about 8 minutes in and the solution is to store the int/float by value and memcpy the raw bytes into the float/int whose lifetime has already started. This takes care of the lifetime and alignment problems that casting from raw bytes can cause.

Juliean

7,351

Author

May 23, 2021 06:31 AM

Oberon_Command said:
It may appear to work, and some shipping code has depended on it (the famous inverse square root from the Quake III codebase, for instance), but it's actually undefined behavior to reinterpret cast an int as a float and vice versa. I believe this is the case even in C, using C-style casts - it's just so common that most major compilers will “probably” let you get away with it in “most” cases (because of the sheer quantity of existing code that would break if this were enforced strictly), but it could cause the optimizer to produce some very strange results, if for example it decided that the UB can't happen and optimized out all of your actual code as a result.

You're right about that. But i'm also pretty sure that is works on any major compiler - on MSVC by default, and as long as you don't enable “strict aliasing” on GCC or Clang. Which I'm pretty sure I wouldn't be able to do without some reworks if I ever went with those compilers, as I do depend on a UB-reinterprets at a few places.

Oberon_Command said:
If you're going to be moving memory around as opaque bytes, you'll want to treat it explicitly as raw bytes, and be sure you only cast it to the type you know it is back from raw bytes. That is what the std::byte type (introduced in C++'17) is for. In C and in earlier standards, chars serve the same function. If you're in a situation where you need “type punning”, in general, I encourage you to watch this CppCon talk on this exact subject. The float/int punning case is mentioned explicitly at about 8 minutes in and the solution is to store the int/float by value and memcpy the raw bytes into the float/int whose lifetime has already started. This takes care of the lifetime and alignment problems that casting from raw bytes can cause.

Ah, right, I forgot about the std::byte-type. But I already knew about the memcpy-trick, as well as that we now have “std::bit_cast” in c++-20. The only reason I don't use std::bit_cast or memcpy at that point in time is that it introduces a considerable overhead in debug-builds, which is not acceptable for my use-case. I was thinking about making a macro that does reinterpret_cast in debug and bit_cast otherwise though.

Shaarigan

1,471

May 26, 2021 11:34 AM

Languages like C/C++ are treating those instructions different for using the full power of the CPU they're running on. Thanks to the massive amount of impact from the games industry, modern CPUs today have optimized instructions for floating point arithmetics and so it is worth it for the compiler to handle them different. MSVC for example gives the option to also increase performance on the cost for precision in floating point arithmetic.

I guess the reason that languages like Java and C# of course, are handling integers and float different is simply type safety in the first and maybe performance improvements in the second. I know that the .NET JIT is compiling code into assembly when a C# application for example is launched, so the same rules take appearance as like for C/C++, increasing performance with CPU speciic floating point instructions.

I don't know much about your language but if I would implement my first naive thoughts of a compiled scripting language, I'd not make any difference between integers and floats as they're both nothing but data. One of the big benefits of C/C++ over C# in my opinion is that I can treat memory as whatever I want it to be, a byte array, an integer or a floating point number, for as long as I can pass the address as pointer

Juliean

7,351

Author

May 26, 2021 12:48 PM

Shaarigan said:
Languages like C/C++ are treating those instructions different for using the full power of the CPU they're running on. Thanks to the massive amount of impact from the games industry, modern CPUs today have optimized instructions for floating point arithmetics and so it is worth it for the compiler to handle them different. MSVC for example gives the option to also increase performance on the cost for precision in floating point arithmetic.

Yeah, that makes sense - and it seems logical that the compiler would always opt to generate floating-point instructions even if a function does, say, nothing but take a floating-pointer parameter and return it (even if the result is the same if we were to perform the operations via generic mov-instructions). Seems just consequential to me, to always deal with “float” by using float-instructions, when available.

Shaarigan said:
I guess the reason that languages like Java and C# of course, are handling integers and float different is simply type safety in the first and maybe performance improvements in the second. I know that the .NET JIT is compiling code into assembly when a C# application for example is launched, so the same rules take appearance as like for C/C++, increasing performance with CPU speciic floating point instructions.

Ah yeah, it does make sense when we think about JIT. I'm personally not going to deal with JIT in the foreseable future. I'm already getting crazy good results in some syntetic benchmarks from my new versus old system (something like 32x (!) speedups - not that the new system is so good but that old was just really bad), so I'm more focused on getting stuff working again. So I think thats not a reason for me.
Type-safety is an argument. I currently don't have many type-checks in place. I was thinking about doing a debug-stack, but the need didn't really arise yet. Most problems with types just eigther appeared immediately (trying to treat an int as a string), or by getting a stack-underflow.

Shaarigan said:
I don't know much about your language but if I would implement my first naive thoughts of a compiled scripting language, I'd not make any difference between integers and floats as they're both nothing but data. One of the big benefits of C/C++ over C# in my opinion is that I can treat memory as whatever I want it to be, a byte array, an integer or a floating point number, for as long as I can pass the address as pointer

Yeah, I see it the same way. I mean, I was in a bit over my head when I started, so naturally I just made instructions for different data-types. I only now got the experience to go back and say “wait, those are actually functionally the same”. So I was trying to see if there are some obvious reasons for why you wouldn't want to do it, but I don't see anything tangible - I'll have to keep the issues with UB-reinterprets in mind, but other than that I think I'll just merge all the float/int-opcodes for now. Should probably even be a net gain in performance by increasing cache-hit rate and locality of reference/instructions.

SyncViews

844

May 27, 2021 05:20 AM

Juliean said:
Yeah, that makes sense - and it seems logical that the compiler would always opt to generate floating-point instructions even if a function does, say, nothing but take a floating-pointer parameter and return it (even if the result is the same if we were to perform the operations via generic mov-instructions).

With say C/C++, the compiler can't know know that in the general case*. It is generally preferable to have a function calling convention that passes by register to some extent (e.g. the Microsoft x64 default uses RCX, RDX, R8, R9, and XMM0 to 3) , and will also generally be preferable to use the correct register type.
Pretty sure I have seen compilers use the basic mov instruction on floating point types when they are going from memory to memory.

The compiler on the calling side only sees the function signature from the header/function-pointer/etc. it doesn't know if float foo(float a); is going to do arithmetic, or just return. I assume they did the research before into more flexible calling conventions and decided it is not worth the pain for a little extra performance (also a lot of small functions where relative calling overhead is high will get inlined already), beyond the existing __stdcall, __cdecl, __fastcall, __vectorcall, etc.
And of course the compiler for the function itself has to accommodate what the caller will do, so will use the XMM registers even if just compiling a return a; (although I just realised a float foo(float a) { return a; } might actually be a no-op since is moving XMM0 to XMM0).

* I guess link time code gen changes this, as well as calling functions in the same translation unit, but it seems would be adding a whole mess of complexity that only applies to some cases so unless there was a compelling performance reason.

Juliean said:
I'll have to keep the issues with UB-reinterprets in mind, but other than that I think I'll just merge all the float/int-opcodes for now.

Isn't the memory to memory case the only one that is fully safe to merge though?

If I recall the comparison instructions are different because of NaN values, and I think some other considerations like signed zero. And of course all the arithmetic operations are different as well.

SSE and other vector extensions do actually combine some things (not sure on why exactly they kept scalar integers but transitioned scalar floats to SSE, maybe really wanted to avoid the 80bit stuff?), but there is still a lot of instructions that are integer or fp specific (for the same data size, e.g. packed 32bit float and int).

Juliean

7,351

Author

May 27, 2021 11:22 AM

@sync views Thanks also for the insights!

SyncViews said:
* I guess link time code gen changes this, as well as calling functions in the same translation unit, but it seems would be adding a whole mess of complexity that only applies to some cases so unless there was a compelling performance reason.

I didn't really look at the link-time output, as I'm mostly using godbolt.org for this kind of stuff and I don't think they have a link-time optimizer. But if they did, I'm fairly certain it might end up producing the same functions - I've seen that kind of stuff where especially template-functions with different types all end up being merged back into one block of ASM.

SyncViews said:
Isn't the memory to memory case the only one that is fully safe to merge though?

Memory-to-memory is safe, but if you look at the current OpCodes I posted, then:

m_stack.Pop<int>();

results in an reinterpret_cast<int*>() on the stacks memory, which I'm pretty sure I agree is actually UB (you can cast anything to char* or void*, but not the other way around). In practice, as I said I'm only running compiler(s) that don't have a problem with this kind of stuff. But I know UB can be nasty. The most annoying issue that I ever had is with pretty much the following code:

void dontAskMeWhy(Class* pObject)
{
	bool isNull = !pObject;
	auto& local = *pObject;
	
	if (!isNull)
		local.Function();
}

Without talking about the details of the code in question, I was assuming that this was safe. But as dereferencing a nullptr is UB, Clang just decided that it doesn't have to do the if-check at all. So compilers can and will absoluetely take advantage of UB to pretty much decide your code is not valid. With reinterpret_casts, I think I've already read about cases where the compiler will discard an entire block of code because it knew that the initial cast is not valid (which I'm afraid would probably happen to my code here if I were on compiler that gave a fuck :D )

SyncViews

844

May 27, 2021 01:20 PM

Juliean said:
I didn't really look at the link-time output, as I'm mostly using godbolt.org for this kind of stuff and I don't think they have a link-time optimizer. But if they did, I'm fairly certain it might end up producing the same functions - I've seen that kind of stuff where especially template-functions with different types all end up being merged back into one block of ASM.

Well godbolt uses GCC which does, but if you are only using 1 source file it doesn't matter. Link time code gen is just a way to optimise for multiple source files and even static libs since the linker sees all the files, but the compiler only sees the one source file when making the obj. It was just an aside that given a float foo(float a); a modern compiler actually might in some cases, know that it is OK to put a in an integer register or something, but in the general case, it should use the floating point specific conventions.

C++ templates of course complicated this a bit, but a linker merging identical functions is a lot simpler and doesn't use link time generation (it could just compare the final compiled functions in the object file).

Juliean said:
Memory-to-memory is safe, but if you look at the current OpCodes I posted, then: m_stack.Pop(); results in an reinterpret_cast() on the stacks memory, which I'm pretty sure I agree is actually UB (you can cast anything to char* or void*, but not the other way around). In practice, as I said I'm only running compiler(s) that don't have a problem with this kind of stuff. But I know UB can be nasty. The most annoying issue that I ever had is with pretty much the following code:

Yeah, but I meant only this memory to memory case is safe (and would be fully defined if you changed your implementation*), so you save only a few opcodes at most. You should still have type specific opcodes for all the comparisons, all the arithmetic, etc.

Since your store/load only copies values, you definitely should be able to make it safe. Since you are basically reimplementing memcpy(stack_base + stack_size, stack_base + offset, 4), and while memcpy is usually special in the compiler for optimisation reasons, I don't believe it is in the language spec and I believe a pure-C implementation is possible.

If there is a place you are completely breaking the rules, I'd guess it is in other ops, e.g. if you did say:

auto result = (*reinterpret_cast<float*>(stack_ptr + offset_a)) * (*reinterpret_cast<float*>(stack_ptr + offset_b));

And I think even then only if the compiler could prove that you previously accessed those as something other than a float.

But again copying to a local float first should be safe I believe, and might even get optimised out (into just a single load instruction for the register representing the local variable).

Juliean

7,351

Author

May 27, 2021 02:11 PM

SyncViews said:
Well godbolt uses GCC which does, but if you are only using 1 source file it doesn't matter. Link time code gen is just a way to optimise for multiple source files and even static libs since the linker sees all the files, but the compiler only sees the one source file when making the obj. It was just an aside that given a float foo(float a); a modern compiler actually might in some cases, know that it is OK to put a in an integer register or something, but in the general case, it should use the floating point specific conventions.

It probably just needs to be switched on, I only really know the bare-bone switches for optimization-levels to get me by.

SyncViews said:
C++ templates of course complicated this a bit, but a linker merging identical functions is a lot simpler and doesn't use link time generation (it could just compare the final compiled functions in the object file).

I always just assumed they were they same. Its true that COMDAT-folding is an optimization with a separate setting even in MSVC.

SyncViews said:
Yeah, but I meant only this memory to memory case is safe (and would be fully defined if you changed your implementation*), so you save only a few opcodes at most. You should still have type specific opcodes for all the comparisons, all the arithmetic, etc.

Puh, I'm not an expert on the c++-standard, but from the wording I've read (can't find it quickly right now) I always assumed that even this was illegal:

char* pMemory;
const int value = *reinterpet_cast<int*>(pMemory);

No matter what I actually end up doing (which you are right, that actual operations on the data other than copying it would happen with the right type).

SyncViews said:
But again copying to a local float first should be safe I believe, and might even get optimised out (into just a single load instruction for the register representing the local variable).

Yes, it will definately be optimized out of release-builds. Unfortunately I'm in a situation where debug-build performance really matters. If it was only for release-performance, I wouldn't have needed the whole rewrite so badly if it wasn't for debug-performance (in release even the old system was fast enough for most intents and purposes). Now I know this is a delicate line. And I could also just always set the interpreter.cpp to compile as “release”. But for actual debug, I did measure a huge impact of at least std::bit_cast (2-3x as slow), so thats why I went back to reinterpet_casts. Thats BTW also why the code I posted is not a template-method but just same C&P code for int/float - I usually heavily use template-functions, and I usually don't have a problem with using small inline functions but I really don't want to impose the overhead of one additional function-call for all opcodes in debug.

Oberon_Command

6,371

May 27, 2021 03:12 PM

SyncViews said:
Since your store/load only copies values, you definitely should be able to make it safe. Since you are basically reimplementing memcpy(stack_base + stack_size, stack_base + offset, 4), and while memcpy is usually special in the compiler for optimisation reasons, I don't believe it is in the language spec and I believe a pure-C implementation is possible.

memcpy is special for more reasons than that:

Objects of implicit-lifetime types can also be implicitly created by:
…
call to following object representation copying functions, in which case such objects are created in the destination region of storage or the result:
std::memcpy
std::memmove

🎉 Celebrating 25 Years of GameDev.net! 🎉

Handling "float" in in generic memory (blob)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Handling "float" in in generic memory (blob)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines