Catastrophic Failure

your system is larger than you think.

last modified on

In a past life, I was a software verification researcher, and attended the NASA Formal Methods conference in 2014. The opening speaker at NFM gave a talk with various anecdotes about various kinds of failure, from “an implicit double-to-float cast wasted an entire mission” to catastrophic failure, in which someone dies. Catastrophic failure at NASA is big and obvious and makes the news: if the rocket goes wrong then everyone onboard dies in a big fireball a mile above the ground.

They then moved on to work they had done for the US Postal Service, who had been experiencing thefts by counter staff, and wanted a new point-of-sale system to combat this. This involved finding a set of constraints for operating the cash register such that they could do their job while being unable to take anything extra. The speaker laid out the constraints, and offered a $100 Postal Order to the person who could find the Catastrophic Failure.

audience: could it become locked out and inoperable?
speaker: no.
audience: could they take money by doing …?
speaker: no.

This went on for a while, as we tried to find deadlocks or vulnerabilities in the system. Eventually,

audience: what if someone came in with a gun and demanded the money?
speaker: bingo.

USPS decided that stemming theft was not worth risking the death of an employee.

What this story tells us is that software has consequences. It’s easy to look at a missile guidance system or High Frequency Trading and say “that’s unethical!”, but far more mundane software performing far more mundane tasks can also have dangerous or even lethal failure modes.

For example, banks are notoriously bad at updating names, and deadnames can resurface at inopportune moments that risk outing the user to housemates. Parental spyware will out a kid to their parents, risking homelessness or suicide.

As engineers we must keep the whole system in mind, including its users and their wider lives and situations. We must respond to our products’ worst failure modes, no matter how unlikely we believe them to be. You cannot rollback a corpse.

And if that means a product or feature does not launch, then so be it.