Back to General and Gameplay Programming

Double to float C++

General and Gameplay Programming Programming

Started by taby May 13, 2024 04:27 PM

166 comments, last by taby 10 hours, 30 minutes ago

taby

1,507

Author

May 13, 2024 04:27 PM

What arcane magic occurs when a double is cast as a float?

Juliean

7,350

May 13, 2024 08:18 PM

Cannot be answered generally. The standard does not mandate any specific implementation to be used for eigther float or double. Reading the section of cppref (https://en.cppreference.com/w/cpp/language/types), it can be inferred that they are “usually IEEE-754”. So we can look at implementations that use those, like windows - this article on MSDN does highlight the process:

https://learn.microsoft.com/en-us/cpp/c-language/conversions-from-floating-point-types?view=msvc-170

When the compiler converts a double or long double floating-point number to a float, it rounds the result according to the floating-point environment controls, which default to "round to nearest, ties to even." If a numeric value is too high or too low to be represented as a numeric float value, the conversion result is positive or negative infinity according to the sign of the original value, and an overflow exception is raised, if enabled.

On x64, this will practically/likely mean that a “cvtsd2ss” instruction will be used, which again is documented (https://www.felixcloutier.com/x86/cvtsd2ss).
So nothing really arcane nor magic going on, not anymore than converting a float to an int would be.

taby

1,507

Author

May 14, 2024 01:17 AM

OK, thank you for all of the insight.

Here's the snippet of code that I'm having trouble with. Basically, I need to cast as float. If I do not cast to float, the calculation returns the wrong value. If I cast as float, it works perfectly.

Here is the code:

void proceed_Euler(custom_math::vector_3& pos, custom_math::vector_3& vel, const long double G, const long double dt)
{
	const custom_math::vector_3 grav_dir = sun_pos - pos;
	const double distance = grav_dir.length();
	const double Rs = 2 * grav_constant * sun_mass / (speed_of_light * speed_of_light);

	const double alpha = 2 - sqrt(1 - (vel.length() * vel.length()) / (speed_of_light * speed_of_light));

	double beta = sqrt(1 - Rs / distance);
	beta = static_cast<float>(beta);

	custom_math::vector_3 accel = grav_acceleration(pos, vel, G);

	vel += accel * dt * alpha;
	pos += vel * dt * beta;
}

Instead of casting as float, I also tried to set the number of decimal places by using this function:

// Stolen from Stack Exchange
double precision(double f, int places)
{
	long double n = std::pow(10.0f, places);
	return std::round(f * n) / n;
}

No joy though. Only the casting as float works.

Any ideas?

taby

1,507

Author

May 14, 2024 02:15 AM

P.S. I've also tried this on Ubuntu WSL.

JoeJ

4,250

May 14, 2024 05:59 AM

taby said:
I need to cast as float. If I do not cast to float, the calculation returns the wrong value. If I cast as float, it works perfectly.

You say that this line:

beta = static_cast<float>(beta);

is needed to make the code work? And it does not work if you comment it out?

I mean, it converts to float, then converts back to double since beta is double. So all it does is reducing precision.
What kind of improvement does this give?

Did you try to log the numbers to a file to compare?

Edit: If distance is smaller than Rs, you take the square root of a negative number. Maybe this causes the problem.

taby

1,507

Author

May 14, 2024 03:29 PM

Thanks once again for all of your insight. It's truly helpful.

JoeJ said:
You say that this line:
beta = static_cast<float>(beta);
is needed to make the code work? And it does not work if you comment it out?

If I comment the casting out, I get an answer of 7.75. If the casting remains, I get an answer of 42.66. The analytical solution gives an answer of 42.94.

I mean, it converts to float, then converts back to double since beta is double. So all it does is reducing precision.
What kind of improvement does this give?

This is what I'm having so much trouble with. Basically, it's eerie how the solution comes up with such a correct answer. I tried doing the truncation/rounding using the precision() function, but it's not quite the same. This is what I wonder about.

Did you try to log the numbers to a file to compare?

To the screen using cout, but basically, yes.

Edit: If distance is smaller than Rs, you take the square root of a negative number. Maybe this causes the problem.

Not a problem. We're modelling the orbit of Mercury.

P.S. It's like we're quantizing gravitation. I dare say this half-seriously.

taby

1,507

Author

May 14, 2024 03:34 PM

The full code is at:

https://github.com/sjhalayka/mercury_gr

JoeJ

4,250

May 14, 2024 03:42 PM

taby said:
If I comment the casting out, I get an answer of 7.75. If the casting remains, I get an answer of 42.66. The analytical solution gives an answer of 42.94.

Can you post example numbers so i can reproduce such issue?

Juliean

7,350

May 14, 2024 03:53 PM

taby said:
If I comment the casting out, I get an answer of 7.75. If the casting remains, I get an answer of 42.66. The analytical solution gives an answer of 42.94.

taby said:
Not a problem. We're modelling the orbit of Mercury.

Are the distances/floating point values at any point in the calculation extremely large? Double→float does not only reduce precision, it also has to clamp values that were somehow out of range of the smaller number. That would be my only remaining guess of what could explain a difference.

Otherwise, you need to get a debugger out, and step through both versions of the code to see a difference. You obviously know what number you expect in a certain case. Record the inputs that you feed into the function, that then produces the different values. Make proceed_EulerDouble and proceed_EulerFloat. Call them with the same parameters, with a debugger attached, in a simplified version of the app. Step through the code and examine the values at each step. This is the only surefire way to diagnose this.

taby

1,507

Author

May 14, 2024 04:06 PM

@joej I very recently just put the link to the code in a previous post. it’s set up to use GLUT, but you can comment that stuff out — it’s way faster if you do.

Double to float C++

Popular Topics

Recommended Tutorials

Double to float C++

Popular Topics

Recommended Tutorials

Reticulating splines