The critical role of systems thinking in software development

Software applications exist to serve practical human needs, but they inevitably accumulate undefined and defective behaviors as well.

Because software flaws are often left undiscovered until some specific failure forces them to the surface, every software project ships with some degree of unquantified risk. This is true even when software is built by highly skilled developers, and is an essential characteristic of any complex system.

When you really think about it, a software system is little more than a formal mathematical model with incomplete, informally specified inputs, outputs, and side effects, run blindly by machines at incomprehensibly high speeds. And because of that, it’s no surprise that our field is a bit of a mess.

This chaotic environment becomes more comprehensible when you think of software not as rules rigidly defined in code, but as a living system with complex emergent behavior. Where programmers and people using an application see a ‘bug’, a systems theorist would see just another lever to pull that produces some sort of observable outcome. In order to develop a better mental model for the systems we build, we’ll need to learn how to think that way, too.

But instead of getting bogged down in theory, let’s work through a quick example of what complex emergent system behavior looks like in a typical web application.

A brief story on how small flaws can compound to create big problems

Suppose that you are maintaining a knowledge base application… a big collection of customer support articles with a content management system for employees to use. It’s nothing fancy, but it does its job well.

The knowledge base website is low-traffic, but it’s important to the company you work for. To quickly identify failures, it has a basic exception monitoring system set up which sends an email to the development team every time a system error happens.

This monitoring tool only took a few minutes to set up, and for years you haven’t even had to think about its presence except when alert emails are delivered—and that doesn’t happen often.

But one day, you arrive at work and find yourself in the middle of a minor emergency. Your inbox is stuffed with over 1300 email alerts that were kicked up by the exception reporting system in the wee hours of the morning. With timestamps all within minutes of each other, it is pretty clear that the this problem was caused by some sort of bot. You dig in to find out what went wrong.

The email alerts reveal that the bot’s behavior resembled that of a web crawler: it was attempting to visit every page on your site, by incrementing an id field. However, it had built the request URLs in a weird way; constructing a route that no human would ever think to come up with.

When the bot hit this route, the server should have responded with a 404 error to let it know that the page it requested didn’t exist. This probably would have convinced the bot to go away, but even if it hadn’t, it would at least prevent unhandled exceptions from being raised.

For nearly any invalid URL imaginable, this is the response the server would have provided. But the exact route that the bot was hitting just so happened to run some code that, due to a flawed implementation, raised an exception rather than failing gracefully.

The immediate solution to this problem is straightforward: Fix the defective code so that an exception is no longer raised, add a test that probably should have been there in the first place, then finally temporarily disable the exception reporter emails until you can find a new tool that will not flood your inbox in the event of a recurring failure.

If this were really your project, you’d probably take care of those two chores right away (treating it as the emergency it is), but then might be left wondering about the deeper implications of the failure, and how it might relate to other problems with your system that have not been discovered yet.

Failure is almost never obvious until you’re looking in the rearview mirror

If the scenario from the story above seemed oddly specific, it’s because I’ve dealt with it myself a few years ago. The context was slightly different, but the core problems were the same.

In retrospect, it’s easy to see the software flaw described above for what it was: an exposed lever that an anonymous visitor could pull to deliver an unlimited amount of exception report emails.

But at the time, the issue was baffling to me. In order for this problem to occur, a bot needed to trigger a very specific corner case by passing an invalid URL that could never be imagined in testing. The code handling this request would have handled this failure case properly when it was originally written, but at some point the query object we were using was wrapped in a decorator object. That object introduced a slight behavior change, which indirectly lead to exceptions being raised.

The behavior change would not be obvious on a quick review of the code; you’d need to read the (sparse) documentation of a third party library that was in theory meant to provide a fully backwards-compatible API. Some extra tests could have potentially caught this issue, but the need for such tests was only obvious in hindsight.

A real person using the website would never encounter this error. A developer manually testing the site would never encounter this error. Only a bot, doing bad things by pure coincidence, managed to trigger it. In doing that, it triggered a flood of emails, which in turn put a shared resource at risk.

I’m embarrassed to admit that the real scenario was a bit worse, too. The email delivery mechanism I was using for sending out exception reports was the same mechanism used for sending out emails to customers. Had the bot not just “given up” eventually, it would have likely caused a service interruption on that side of things as well.

This is the swamp upon which our castles are built. Hindsight is 20/20, but I’m sure you can come up a similarly painful story from your own work if you look back far enough in your career.

Accepting software development as an inherently hazardous line of work

I wish that I could give a more confident answer for how you can avoid these sorts of problems in your own work, but the truth is that I am still figuring out all of that myself.

One thing I’d like to see (both in my own work and in the general practices of software developers) is a broadened awareness of where the real boundaries are in the typical software application, and what can go wrong at the outer reaches of a system.

Code reviews are now a fairly common practice, and that is a good thing, but we need to go far beyond the code to effectively reason about the systems we build.

In particular, it’d help if we always kept a close eye on whatever shared resources are in use within a system: storage mechanisms, processing capacity, work queues, databases, external services, libraries, user interfaces, etc. These tools form a “hidden dependency web” below the level of our application code that can propagate side effects and failures between seemingly unrelated parts of a system, and so they deserve extra attention in reviews.

It’s also important to read and write about experiences with failures (and near-misses) so that we gain a shared sense of the risks involved in our work and how to mitigate them.

Many system-level problems are obvious in hindsight but invisible at the time that they’re introduced; especially when a particular failure requires many things to go wrong all at once for the negative effects to happen.

Finally, we are not the only field to deal with developing and operating complex systems, so we should also be looking at what we can learn from other disciplines.

Richard Cook’s excellent overview of How Complex Systems Fail is one example of ideas that originated in the medical field which apply equally well to software development, and I strongly recommend reading it as source of inspiration.

One last thought…

When software literally shipped on ships—destined to run on a particular set of known devices and solve a well-defined, static purpose—the programmer’s role was easier to define. Now that everything is connected to everything else, and the entire economy depends on the things we build, we have more work to do if we want to make our systems both safe and successful in the modern world.

Although it overwhelms me as much as anyone else, I’m up to the challenge of writing code for the year we’re living in. For now, that means dealing with extreme complexity and a lack of predictability at the boundary lines of our systems. The example I gave in this article is at the shallow end of that spectrum, but even it is not obvious without some careful practice.

If you haven’t already, I hope you’ll join me in going beyond raw coding skills, and begin studying and practicing systems thinking in your daily work.

Editor’s note: Gregory Brown’s book about the non-code aspects of software development, called “Programming Beyond Practices,” will be published soon by O’Reilly. Follow its progress here.