Context-switching gone wrong
I remember while this bug wasn’t that hard to fix (or find), it still had some “fantastic” potential to it. Because, when it appeared, at first nothing made sense anymore…
In this blog, I write about bugs in low-level systems code which were hard to find, meaning it took me hours, days or sometimes even weeks. I hope you will find some joy in my suffering:
I remember while this bug wasn’t that hard to fix (or find), it still had some “fantastic” potential to it. Because, when it appeared, at first nothing made sense anymore…
Some bugs are hard to find because when they manifest, they do so in a very misleading way. This one is about fairly unexpected consequences when things aren’t properly aligned. Most of the time alignment problems lead to “just” bad performance. In some other cases, hardware architectures may generate a fault on misaligned accesses (but this is easy to pin-point and fix). However, in our case the problem was a little more subtle.
It was a particularly mischievous bug. At the time, we were writing a new research prototype kernel in our group. Research operating systems are fun because you typically end up writing a lot of code until you get something very basic (like print) running. In fact, it’s often the case that you end up with a system or function call which can’t be fully implemented at the moment (because it would require too much code or other things are still missing for it). Therefore, every so often, you have to cut some corners and do a temporary, partial implementation to make progress on a goal.