🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to General and Gameplay Programming

Double to float C++

General and Gameplay Programming Programming

Started by taby May 13, 2024 04:27 PM

181 comments, last by JoeJ 1 week ago

JoeJ

4,258

May 23, 2024 06:07 AM

taby said:
Yes, Mach's principle is what you're looking for.

Looking that up i see they still try to confirm it with experiments.
Little do we know…

taby said:
Edit: I'm giving up.

Well, maybe god will tell us more about the universe after we die.
At least the few things he knows about it. ; )

taby said:
I'm sure that you'll figure it out, after more thought.

Sadly the solution is to be more conservative about the upper bound on how far a local change can spread through my geometry.
If i move an object, initial geometry only changes locally. But after that i calculate smooth crossfield, wavefield, remesh, hierarchy, and materials is still in front of me. With each processing pass i do, the change can affect a larger distance, because each pass considers hierarchical adjacency.
The result is that i need to update a large area even for small changes.

Maybe it does not matter my progress so slow. To make this practical for production, very powerful HW is needed to keep waiting times acceptable. I depend on the progress of the HW guys i criticize so much. ; )

taby

1,508

Author

May 23, 2024 03:06 PM

Good day JoeJ. Thank you again for all of your help.

I'm sad because we can't get it to work. That said, just because we can't do it manually doesn't detract from the findings. That is, where alpha = beta = 1, the negative precession that comes with Euler integration goes to zero as dt goes to 0. On top of that, there is relativistic, positive, precession that comes with alpha ≠ beta ≠ 1. It's no coincidence – if you quantize it further, beyond some threshold, then the solution pops into existence. Just because we don't know how to emulate static_cast<double>(static_cast<float>(d)), doesn't mean that it doesn't work.

taby

1,508

Author

May 23, 2024 03:30 PM

P.S. I’ve been procrastinating, when it comes to painting. I’ve been stuck on this snap-to-float problem for weeks now. lol

JoeJ

4,258

May 23, 2024 04:19 PM

taby said:
Just because we don't know how to emulate static_cast(static_cast(d)), doesn't mean that it doesn't work.

But it does work?
I thought you give up because you think reducing precision can't be right, or for other reasons.
I mean, if the casting stuff works for you but nothing else does, then you could just keep the casting? IT's thje same as zeroing out right bits anyway.

Here's my result. The only difference is the one off error on the last bit due to missing rounding:

If this missing bit would is a problem, just reduce the shift to 28. Then it should mach the casting exactly for the number range close to one.
My printing is limited to 32 bits, so i can't compare as nicely as with your code. But just try again.

static int shift = 52-23; ImGui::DragInt("shift", &shift, 0.1f, 0, 63);
			for (double d = .0; d < 1.1; d+=0.02)
			{
				double ref = static_cast<double>(static_cast<float>(d));

				uint64_t bits = (uint64_t&)d;
				bits = bits & (uint64_t(-1ull)<<shift);
				double emu = (double&)bits;

				ImGui::Text("ref %f %x | emu %f %x", ref, int(((uint64_t&)ref)>>shift), emu, int(((uint64_t&)emu)>>shift));
			}

taby

1,508

Author

May 23, 2024 04:32 PM

sorry about the confusion… it’s on my end, sorry.

Casting to and from float works perfectly, for either integrator. The only big difference is that symplectic 4th order integrator is technically like one order of magnitude faster than Euler integrator.

Thanks for more help! I will check it out ASAP. You’re the best, man.

taby

1,508

Author

May 23, 2024 04:35 PM

P.S. If we can snap to float manually, then that would be perfect. this way we can know by how much we are over quantizing. I’m like… what if the scale is like 10^-11? LOL

taby

1,508

Author

May 23, 2024 05:03 PM

I'm sorry to say that it doesn't work.

double truncate_normalized_double(double d)
{
	if (d <= 0.0)
		return 0.0;
	else if (d >= 1.0)
		return 1.0;

	//////return static_cast<double>(static_cast<float>(d));

	uint64_t shift = 52 - 23;
	uint64_t max = -1;

	uint64_t bits = (uint64_t&)d;
	bits = bits & (uint64_t(max << shift));
	double emu = (double&)bits;

	return emu;
}

int main(void)
{
	cout << setprecision(30) << endl;

	for (double d = 0; d < 1.0; d += 0.1)
		cout << truncate_normalized_double(d) << endl;

	return 0;
}

JoeJ

4,258

May 23, 2024 06:28 PM

taby said:
uint64_t max = -1;

Your port does not work for me either, probably because of this line.
Any integer number is treated as 32 bit, so it will be converted to 0x0000000FFFFFFFF instead 0cFFFFFFFFFFFFFFFF. Which masks away all the precious significant bits.

It should work if you write -1ull, but if not - copy my code precisely.

Signs cause confusion with bit math, which with 64 bit types is a annoying problem due to legacy C / C++ convention. It's also a grey zone since a right shift of negative integers is actually undefined behavior afaik, thus always use unsigned types.

taby

1,508

Author

May 23, 2024 06:38 PM

This works:

double truncate_normalized_double(double d)
{
	if (d <= 0.0)
		return 0.0;
	else if (d >= 1.0)
		return 1.0;

	//////return static_cast<double>(static_cast<float>(d));

	float f = static_cast<float>(d);

	float tempf = nexttowardf(1.0f, f);

	while (tempf > f)
		tempf = nexttowardf(tempf, f);

	return static_cast<double>(tempf);
}

taby

1,508

Author

May 23, 2024 06:40 PM

This doesn't work. I checked the value of max, and it is 2^64 - 1, as expected.


double truncate_normalized_double(double d)
{
	if (d <= 0.0)
		return 0.0;
	else if (d >= 1.0)
		return 1.0;

	//////return static_cast<double>(static_cast<float>(d));

	uint64_t shift = static_cast<uint64_t>(52) - static_cast<uint64_t>(23);
	uint64_t max = static_cast<uint64_t>(-1);

	uint64_t bits = (uint64_t&)d;
	bits = bits & (uint64_t(max << shift));
	double emu = (double&)bits;

	return emu;
}

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Double to float C++

Popular Topics

Recommended Tutorials

Reticulating splines