Crash your code – Lessons Learned From Debugging Things That Should Never Happen™

Let’s be honest, no one likes to see their program crash. It’s a clear sign that something is wrong with our code, and that’s a truth we don’t like to see. We try our best to avoid such a situation, and we’ve seen how compiler warnings and other static code analysis tools can help us to detect and prevent possible flaws in our code, which could otherwise lead to its demise. But what if I told you that crashing your program is actually a great way to improve its overall quality? Now, this obviously sounds a bit counterintuitive, after all we are talking about preventing our code from misbehaving, so why would we want to purposely break it?

Wandering around in an environment of ones and zeroes makes it easy to forget that reality is usually a lot less black and white. Yes, a program crash is bad — it hurts the ego, makes us look bad, and most of all, it is simply annoying. But is it really the worst that could happen? What if, say, some bad pointer handling doesn’t cause an instant segmentation fault, but instead happily introduces some garbage data to the system, widely opening the gates to virtually any outcome imaginable, from minor glitches to severe security vulnerabilities. Is this really a better option? And it doesn’t have to be pointers, or anything of C’s shortcomings in particular, we can end up with invalid data and unforeseen scenarios in virtually any language.

It doesn’t matter how often we hear that every piece of software is too complex to ever fully understand it, or how everything that can go wrong will go wrong. We are fully aware of all the wisdom and cliches, and completely ignore them or weasel our way out of it every time we put a /* this should never happen */ comment in our code.

So today, we are going to look into our options to deal with such unanticipated situations, how we can utilize a deliberate crash to improve our code in the future, and why the average error message is mostly useless.

When Things Go Wrong

Let’s stick with a scenario where we end up with unexpected garbage data. How we got in such a situation could have many reasons: bad pointer handling, uninitialized variables, accessing memory outside defined boundaries, or a bad cleanup routine for outdated data — to name a few. How such a scenario ends, depends of course on the checks we perform, but more importantly, exactly what data we’re dealing with.

In some cases the consequences will be fairly obvious and instant, and we can look into it right away, but in the worst case, the garbage makes enough sense to remain undetected at first. Maybe we are working with valid but outdated data, or the data happens to be all zeroes and a NULL check in the right spot averts the disaster. We might even get away with it altogether. Well, that is, until the code runs in a whole different environment for the first time.

Everything is easier with an example, so let’s pretend we collect some generic data that consists of a time stamp and a value between 0 and 100 inclusive. Whenever the data’s time stamp is newer than the previous one, we shall do something with the value.


struct data {
// data timestamp in seconds since epoch
time_t timestamp;
// new data value in range [0, 100]
uint8_t value;
};

void do_something(struct data *data) {
// make sure data isn't NULL
if (data != NULL) {
// make sure data is newer than the previous
if (data->timestamp > last_timestamp) {
// make sure value is in valid range
if (data->value <= 100) {
// do something with the value
...
} else {
// this should never happen [TM]
}
// update timestamp
last_timestamp = data->timestamp;
}
}
}

This seems like a reasonable implementation: no accidental NULL dereferencing, and the logic matches the description. That should cover all the bases — and it probably does, until we end up with a pointer that leads to a bogus time stamp thousands of years from now, causing all further value processing to be skipped until then.

Often times, a problem like this gets fixed by adjusting the validation check. In our example, we could include the current time and make sure that time differences are within a certain period, and we should be fine. Until we end up in a situation where the time stamp is fine, but the value isn’t. Maybe we see a lot of outliers, so we add extra logic to filter them out, or smoothen them with some averaging algorithm.

As a result, the seemingly trivial task of checking that the data is newer and within a defined range exploded in overall complexity, potentially leading to more corner cases we haven’t thought about and we need to deal with at a later point. Not to mention that we ignore the simple fact that we are dealing with data that shouldn’t be there in the first place. We’re essentially treating the symptoms and not the cause.

Crash Where Crashing Is Due

The thing is, by the time we can tell that our data isn’t as expected, it’s already too late. By working around the symptoms, we’re not only introducing unnecessary complexity (which we most likely have to drag along to every other place the data is passed on to), but are also covering up the real problem hiding underneath. That hidden problem won’t disappear by ignoring it, and sooner or later it will cause real consequences that force us to debug it for good. Except, by that time, we may have obscured its path so well that it takes a lot more effort to work our way back to the origin of the problem.

Worst case, we never get there, and instead, we keep on implementing workaround after workaround, spinning in circles, with the next bug just waiting to happen. We tiptoe around the issue for the sake of keeping the program running, and ignore how futile that is as a long-term solution. We might as well give up and abort right here and now — and I say, you should do exactly that.

Sure, crashing our program is no long-term solution either, but it also isn’t meant to be one. It is meant as indicator that we ended up in a situation that we didn’t anticipate, and our code is therefore not prepared to properly handle it. What led us there, and whether we are dealing with an actual bug or simply flawed logic in our implementation is a different story, and for us to find out.

Obviously, the crash itself won’t solve the problem, but it will give us a concrete starting point to look into what’s hidden underneath. We probably would have ended up in that same spot if we worked our way back from a crash happening somewhere a couple of workarounds later, but our deliberate crash early on lets us skip that and gives us a head start. In other words, spending a few minutes on a minor nuisance like implementing a proper check can save us hours of frustrating debugging down the road.

So let’s crash our code! A common way to do that is using assert() where we give an expected condition, and if that condition is ever false, the assert() call will cause the program to abort. Let’s go the extreme way and replace all conditions in our example with assertions.


void do_something(struct data *data) {
// make sure data is not NULL
assert(data != NULL);

// make sure timestamp is valid and update it
assert(validate_timestamp(data->timestamp));
last_timestamp = timestamp;

// make sure the value is in valid range
assert(data->value <= 100);

// do something with the value as before
...
}

Now, at the first sign of invalid data, the corresponding assertion will fail, and the program execution is aborted:


$ ./foo
foo: foo.c:64: do_something: Assertion `data->value <= 100' failed.
Aborted (core dumped)
$

Great, we have the crash we are looking for. There are only two problems with assertions.

Assertions Are Optional

By design, assertions are meant as a debugging tool during development, and while the libc documentation advises against it, it is common practice to disable them for a release build. But what if we don’t catch a problem during development, and it shows up in the wild one day? And chances are, that’s exactly what’s going to happen. Without the assertion code, neither the check that would prevent the problem is performed, nor would we get any information about it.

Okay, we are talking about purposely crashing our code here, so we could just make it a habit to always leave the assertions enabled, regardless of debug or release build. But that leaves still one other problem.

Assertion Messages Are Useless

If we take a look at the output from a failed assertion, we will know which assert() call exactly failed: the one that made sure the value is in valid range. So we also know that we are dealing with an invalid value. What we don’t know is the actual value that failed the assertion.

Sure, if we happen to get a core dump, and the executable contains debug information, we can use gdb to find out more about that. But unfortunately, we don’t often have that luxury outside of our own development environment, and we have to work with error logs and other debug output instead. In that case, we are left with output of very little value.

Don’t get me wrong, knowing where exactly something went wrong is definitely more helpful than no hint at all, and assertions offer great value for little effort here. And that’s the problem: if the alternative is no output at all, then yes, knowing in which line our problem occurred seems like a big win we could settle for. Considering the popularity error handling usually enjoys among programmers, it’s easy to see why we would be happy enough with that — but honestly, we shouldn’t be. Plus, it promotes bad habits for writing error messages ourselves.

Crashing Better

If you ever find yourself in a situation where you have a myriad of reports of the exact same issue, and you are lucky enough to have an error log available for each individual incident, you will learn how frustratingly helpless it feels to know a certain condition failed, but to have zero information what exactly made it fail. It is when you realize and learn the hard way how useless, and almost counterproductive, error messages in the form of “expected situation is not true, period” without any further details really are.

Consider the following two error messages:

  1. Assertion `data->value <= 100' failed
  2. data->value is 255, expected <= 100

In the first case, all we know is that the value is larger than 100. Since we’re dealing with an 8-bit integer, it leaves us 155 possible options, and we might have to mentally go through every single one of them in order to understand what could have gone wrong, jumping from one uncertain assumption to the next, trying to find out what value could have caused all this.

However, in the second case, we can skip all that. We already know what value caused the error, shifting our debugging mindset from a generic “why did we get an invalid value?” to a concrete “how could we have ended up with 255 here?”. This gives us another head start in finding the real problem underneath.

So instead of sticking with assertions and their limited information, let’s implement our own crash function, and make it output whatever we want it to. A simple implementation using a variable argument list could look something like this:


#include
#include
#include

void crash(char *format, ...) {
va_list args;

va_start(args, format);
vfprintf(stderr, format, args);
va_end(args);

exit(EXIT_FAILURE);
}

This way we can format our error messages the same way we do with printf(), and we can add all the information we want:


if (data->value <= 100) {
// validation passed, handle the data
...
} else {
crash("data->value is %d, expected <= 100\n", data->value);
}

Note that unlike the assertion’s output, we don’t get the information on the exact location here, but that’s just for simplicity. I’ve put a more elaborate crash() function outputting more details, including printing the function back trace, on GitHub in case you’re curious.

On a quick side note regarding C, Google has developed a bunch on sanitizer tools that are nowadays integrated in gcc and clang that are worth looking into.

No Such Thing As “Too Much Information”

Keep in mind, we are focusing on problems we didn’t anticipate. Some “this should never happen” case that magically did happen. Admitting that it actually could happen, and therefore adding the proper checks for it, is a first important step. But how do we know what to put in our error message? We don’t know yet what we actually need to know, or what could have gone wrong — if we did, we wouldn’t consider it a never-to-happen scenario, but we’d try to prevent it from the beginning.

The simple answer: all of it.

Well, obviously not all, but every detail that could be in the slightest way relevant and related to the situation is likely worth to include in the error message. Once we add validation checks, we have all that information available anyway, so why not use it?

Take the time stamp in our data collection example: just because it was successfully validated doesn’t mean we should forget about it. It might still offer valuable debug information for a failed value validation. Who knows, maybe it reveals an issue at every full hour, or every day at 6:12:16 PM, or shows no pattern whatsoever. Either way, chances are it will help us narrowing down the debug path, and take us yet another step closer to the actual problem.

And even if it doesn’t, and the extra information turns out to be completely irrelevant, we can always filter it out or ignore it. However, what we can’t do is add it after a crash. So don’t be shy about adding as much information as possible to your error messages.

Choose Your Battles

Of course, not every unexpected situation or invalid data scenario necessarily calls for a crash. You probably wouldn’t want to abort the whole program when validating random user input, or if the remote server you request data from is unreachable, or pretty much any case that deals with data out of your direct control. But on the other hand, those situations aren’t fully unexpected either, so having a default fallback workaround in place, or outputting an error without the crash, is a valid way to deal with that.

Nevertheless, making it a habit to provide meaningful information with as much details as possible can help everyone involved to understand the problem better. To give a few examples:

  • Parsing input failed vs
    Invalid parameter abc for command foo
  • Error loading data vs
    Connection timeout while requesting data from server xyz
  • Assertion `data->value <= 100' failed vs
    [1547420380] data->value (0x564d379681fc) was 234

As programmers, we grow up being indoctrinated on the importance of error handling, but in our early years, we rarely learn how to properly utilize it, and we might fail to see any actual benefit or even use for it at all. As time goes by, error handling (among other things like code documentation and testing) often becomes this annoyance that we just have to deal with in order to keep others happy: teachers, supervisors, or that one extra-pedantic teammate. We essentially end up doing it “for them”, and we easily overlook that we ourselves are actually the ones who can benefit the most from it.

Unfortunately, no story-telling can substitute learning that the hard way, but hopefully I could still give some food for thought and new perspective on the subject.

In that sense: Happy Crashing!

(Banner image from the long-lost Crash Bansai gallery of deformed toy automobiles for morbid miniature gardening.)



from Hackaday http://bit.ly/2CFXtnR
via IFTTT