Argh, bitrot.

Published September 19, 2014
Advertisement
Turns out that leaving a project alone for six months is a great way to discover that it's full of mysterious bugs you don't remember having.

I noticed some weird behavior with Era, the Epoch IDE, earlier this evening, and started poking around looking for explanations. It turns out that in some weird combination of circumstances, the Epoch program (Era in this case) can get partway through executing a Windows message handler, crash internally, and then continue on merrily as if nothing had ever happened (except without executing the remainder of the WndProc or whatever else the proc had called out to).

My best guess is that there is some Structured Exception Handling magic going on in the LLVM library that causes the crash to partially unwind the call stack and then continue executing from some weird spot. I don't get reliable enough callstacks to prove this just yet, because the JITted Epoch code is pretty hard to untangle in a debugger.

So the goal of the day is to find out what the actual crash is, hopefully by capturing it under a debugger. But this apparently happens a lot more frequently than I'd realized, because attaching a debugger to the running Era process turns up all kinds of crashes and weird behavior. Something in the Epoch runtime is seriously borked.

At this point, it's 2332 hours (yayy palindromic times) and I'm not liable to get much sleep tonight. This is going to bug the shit out of me. So I grab a fresh beverage and settle in for a heavy debug session.


Initially, I turn on all of the CRT memory debugging features I know of, and fire up Era. Well, I try. Turns out that _CRTDBG_CHECK_ALWAYS_DF really does murder performance... to the tune of Era - which normally starts up and has my code visible in under a second - has been trying to load for the better part of 15 minutes now.

LLVM's optimizer apparently allocates memory like a freaking madman. Obviously not written by game developers.

Meanwhile, Era has pegged a core of my laptop's already-warm CPU and is showing no signs of being ready any time soon. Maybe that beverage needs a little ethanol in it...


At 2352, the IDE still shows no signs of finishing the loading process. The LLVM optimizer is hard at work burning through trillions of allocations and also murdering my poor CPU. If it were anywhere near this painful in Release, I'd seriously consider offering to write them some better allocation strategies so I can stop wasting my youth waiting for the dumb thing to finish calling operator new().

Periodic checking of the progress in the debugger indicates that, yes, we are making progress - it seems that the optimizer is finally working through the last few passes. This is at 0002 hours, so basically 40 minutes have elapsed just waiting for the IDE to load.

I might not whine about Visual Studio's startup time for a little while.


Nah, I'll still whine.


Anyways... of course once optimizations are done, we still have to generate machine code. Turns out this is even worse in terms of millions of tiny allocations. Quick back-of-the-cocktail-napkin estimates show the IDE loading at sometime around 8 AM. Screw this.


Sure enough, a couple of minutes of poking with the per-allocation checking turned off yields pay dirt. Looks like I had an off-by-one error in the library routine for converting strings between UTF-8 and UTF-16. DERP.

Fixing that leads to more interesting crashes, this time somewhere in the garbage collector. It's verging on 0045 and I'm wondering how much more I've got in the tank... but this is too compelling to pass up.

The faulting code looks innocent enough: loop through a giant std::map of allocated string handles, and prune out all the ones that don't have any outstanding references. For some reason, though, std::wstring is barfing deep in a destructor call, apparently because the "this" pointer is something wonky.

My first guess, of course, is that I have mismatched code - something compiled in one way while linking to (or sharing objects with) something from another compilation setup. Time to go spelunking in the Visual Studio project settings...


Sadly, probing into the compiler/linker settings yields no obvious discrepancies. Time for the good ol' magical Clean Rebuild.

No joy. Next attempt is to disable the deletion of garbage strings... it'll murder my memory footprint, but it might also reveal what else is interfering with the string table. This causes the crashes to stop for the most part, even with Application Verifier enabled, which is pesky. I do, however, get a crash when exiting the program - ie. when garbage collection is not destroying strings, but rather the actual teardown process.

This indicates a memory stomp to me... which is slightly terrifying. Something, somewhere, seems to be clobbering entries on the string table. I haven't yet discerned a pattern to the data that gets written on top of the table entries, so it isn't entirely clear what's doing the stomping.

It's 0111 and I'm seriously tired. My best guess is that the stomp originates from the string marshaling code that interacts with external C APIs, specifically in this case the Win32 API. I suspect that I'm doing some evil cast someplace that confuses pointer types and causes chaos, but I'm far too hazy to pinpoint that as the cause for certain tonight.


0116 - time to cash in. We'll see how this goes next time!
3 likes 6 comments

Comments

duckflock

I noticed some weird behavior with Era, the Epoch IDE, earlier this evening, and started poking around looking for explanations. It turns out that in some weird combination of circumstances, the Epoch program (Era in this case) can get partway through executing a Windows message handler, crash internally, and then continue on merrily as if nothing had ever happened (except without executing the remainder of the WndProc or whatever else the proc had called out to).

This behavior is the default behavior for SEH on x64 windows. It happens when an exception occurs after passing through the kernel in a user -> kernel -> user transition (most prominently, message handlers). If something in a message handler throws an unhandled exception, the stack is unwound to the kernel boundary only and the kernel just treats the message as handled, resulting most likely in application state corruption.

See for instance http://blog.paulbetts.org/index.php/2010/07/20/the-case-of-the-disappearing-onload-exception-user-mode-callback-exceptions-in-x64/

A simple way to force a stop is to stop the VS debugger when exceptions are thrown.

September 19, 2014 08:59 AM
ApochPiQ
Awesome, thanks for that link!
September 19, 2014 07:41 PM
duckflock

Glad I could be of help. Always enjoying your journal entries :)

September 20, 2014 08:35 AM
nukomod

It's twisted enjoying another mans debugging pain but I admit I did! I guess I love a good mystery, but it's even better when it gets solved. Please let us know how the next debugging session goes!

September 21, 2014 03:58 PM
ApochPiQ
Turns out that pointer casts are dangerous. Who knew!?


Remember, kiddies, const should not be discarded unless you REALLY mean it. And you probably don't.
September 22, 2014 02:41 AM
Washu
const_cast best cast.
September 22, 2014 03:11 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Profile
Author
Advertisement
Advertisement