shenoydotme Logo

shenoydotme

Bug Bash: 0xdeadbeef

My first job out of grad school was at Garmin, in the Auto OEM division. Our team's mission was to integrate Garmin's navigation and maps software into the infotainment systems made by our OEM partner. This particular project was for Toyota cars in Thailand that used Panasonic hardware.

Our workflow was a complex dance of building code and testing on prototype devices. We would compile our code locally and use a memory stick to upload releases to the devices provided by our OEM partner. There were no real-time logs or breakpoints for debugging - our main tool was a hex debugger to view memory, threads, stacktraces and registers via a JTAG connector. To collect logs, we had to transfer them from the device using a USB stick.

Prototype Hardware

Once a week, we would drop a release via FTP to our partner's servers in Thailand. After each drop, we eagerly awaited feedback and bug reports on Quality Center, sometimes accompanied by stack traces from their testing. Then we'd rinse and repeat, iterating based on their feedback. Our focus shifted frequently - one week working on UI glyphs, the next optimizing nav for Thailand.

One day, during testing, the OEM partner reported an abrupt crash during active navigation that caused the entire system to reboot. There was no video or clear reproduction steps attached - the initial report seemed minor. However, over the next few days more reports came in of the same crash happening inconsistently. When we first examined the stack traces, there was no obvious pattern, as each one failed at a different location. Over the next week, this bug was escalated to management as a priority one issue. Despite the lack of a clear lead from the initial evidence, the intermittent nature and potential customer impact made this very important to investigate.

My teammate Raka and I began investigating the crash. For context, our codebase was a Windows CE port of Garmin's software originally developed for Linux. It used an interface layer to translate Linux function calls to their Windows CE equivalents. The application was multithreaded but ran on a single processor core.

We pored over the stack traces, looking for clues. We added logging everywhere using std::cout, scrutinizing every area - memory leaks, UI glyphs, map tiles, fonts, the navigation DAG, voice output, GPS fix - no stone was left unturned. Despite our best efforts, the bug remained elusive. It didn’t occur in our simulator. However, it would appear occasionally during testing on hardware, but we couldn't reliably reproduce it or pin down the root cause. We were stuck chasing ghosts in the code with each new intermittent occurrence, unable to consistently trigger the crash under the debugger. It was frustrating to have such an impactful bug that seemed to defy all attempts at diagnosis.

After another fruitless week of debugging, we decided on a change of scenery. We loaded the prototype devices into a company car and took a day trip on the long, straight roads of Kansas. We hoped the real-world driving conditions might finally trigger the crash. But despite miles of uneventful highway, the bug refused to surface. Needless to say, our road trip debug session was a bust.

Given our lack of progress, management decided to pull in reinforcements despite our team's domain expertise. They called in Senior and Staff Engineers from other teams to review our code. These engineers knew little about the intricacies of the hardware or overall system architecture that we were working with. Armed with only partial information, they began hypothesizing potential causes, often rehashing theories we had already investigated. While well-intentioned, these outside perspectives failed to deliver any major breakthroughs. We appreciated the extra eyes on the problem, but felt frustrated that senior resources were more of a distraction in this situation, having us chase their speculative guesses.

The next day, Raka and I decided to reexamine the original stack traces more closely, searching for patterns. This time we noticed something - the crashes often occurred just after a memcpy operation. This memcpy was part of a small 16 byte cache optimized for map loading, originally written by our former lead engineer before he left the company.
When we reported this finding as an update, management started questioning the quality of that engineer's code. However, we defended his work, as the cache implementation seemed solid. Our former lead was known for clean, efficient code, so we were confident the issue lay elsewhere. Though it provided a tempting theory, we refused to believe the crash was due to bugs in his cache design without hard evidence.

As another week dragged on without answers, the pressure mounted. Raka and I, needing a break, took a day off to clear our heads. When we returned, refreshed but determined, we decided to focus intensely on the suspect memcpy area. We surrounded it with extra logging and even inserted sleep calls to slow things down.

Finally, our tactics paid off - we were able to trigger the crash consistently for the first time! It felt like a major breakthrough after endless weeks of dead ends. Though the root cause still eluded us, we could now reliably reproduce the bug under the debugger. This new evidence was progress we desperately needed to restart our stalled investigation.

In the memory traces, we noticed something very strange - a portion of our cache was being overwritten with blanks. “WTF?” we went. We started investigating whether one of our threads was erroneously skipping the global mutex and stomping on the cache. But the JTAG-connected debugger showed all our threads acquiring the mutex properly.

Some phantom code somewhere had the address of our cache and was memseting it to zeros. This made no sense - the Linux version of our codebase worked flawlessly, it also worked in the simulator, with no threading bugs or segfaults. Yet on WinCE, this rogue memory overwrite was clearly occurring. We were utterly perplexed as to where this corruption could be coming from.

Desperate for answers, Raka and I hypothesized the memcpy itself could be the issue. We quickly wrote a minimal C++ program that initialized memory to zeros, then repeatedly copied 16 bytes of repeated 0xdeadbeef to that address in an infinite loop. Within seconds of starting, it crashed! The stack trace looked familiar.

This proved the overwrite bug wasn't in our code at all. By isolating the issue in a barebones repro, we had exonerated our codebase. The root cause was somewhere lower in the stack - something at the OS or hardware level was corrupted. We finally had evidence to back up our theory of an external phantom menace.

Our management was relieved we'd finally made tangible progress. We approached the OEM partner and presented our findings. The next day, they returned with an apology - the root cause was on their end all along. It turned out there was a bug in how they had implemented memcpy and other low-level OS functions. In some cases, there was a rare mathematical error in computing memory addresses, causing resulting in memseting the wrong address. Though frustrated by the wild goose chase, we were glad to close the book on this crash mystery.

In recognition of our persistence, our manager awarded Raka and I the "0xdeadbeef" award at the next team meeting - an upside-down toy cow on a small plaque. Despite the cheeky name referencing the infamous bug, we appreciated the gesture after so many grueling weeks. The trophy now sits on my desk, reminding me that chasing peculiar system issues takes tenacity and creativity.

Raka being his usual self
Me at my workstation
The coveted trophy

© 2024 Raj Shenoy. All rights reserved.