Failure States

If you are in the business of building pacemakers, rocket ships, or banking software, I hope you have an extremely low tolerance for failure states. For anyone building anything else, a higher tolerance for failure states offers a dramatic boost to velocity.

How we deal with failure has far reaching effects. It can create a vicious cycle that reinforces defensiveness and risk aversion, or can create a virtuous cycle of testing assumptions and experimentation. The difference is how we deal with failure.

Most teams fall into two strategies:

Failure is not an option.
Failures are easily reversible.

For option 1, imagine an old legacy application with lots of dependencies and high uptime requirements. Test coverage is minimal, and the CI/CD pipeline requires a lot of manual intervention. Deployments can take hours. For the developers working on the system there aren’t a lot of incentives to push changes. Every change is a painful risk of downtime, so only the most obvious and low impact changes are made. It’s a vicious cycle, where performance and usability degrade over time. Even worse is that development teams are aware of the changes that need to be made, but risk aversion prevents the implementation of those changes.

For option 2, imagine a well tested modular application with minimal dependencies and an expectation of iteration. Deployments are fast and easily reverted if unintended behavior is detected. The development team has a cadence of regular deployments that are well tested. These situations create a virtuous cycle where new features can be prototyped quickly and failure states are minimally disruptive. The incentive is to keep building instead of avoiding failure.

Most data teams, despite their best intentions, fall into option #1. Unfortunately, the only way to eliminate software failures is to never change anything. And since data teams are constantly implementing new requirements, making failure less catastrophic is our best chance to maintain velocity.