Failure Rates and Management
You have a test for every method in your code, you have refactored your code such that you have short, sweet methods and it follows the Law of Demeter.
In short, you don’t believe that your code is free from failure, you have proof that it is free from failure: your tests are all green and DRY.
It’s running in production for weeks, months, years now, and no exception, no unexpected behaviour is in sight.
Theory and Practice
Today is, God willing, the space shuttle’s last mission—STS-135—before the whole program is retired.
At the beginning of the program, as NASA embarked on the ambitious goal of building and operating a reusable spacecraft, the management estimated a catastrophic failure rate of 1:100 000, or one failure in 300 years with daily shuttle launches, based on historical data gathered from previous manned and unmanned mission.
Engineers estimated a catastrophic failure rate of 1:100, also based on historical data.
The actual failure rate is 1:67.5 (135 missions, 2 catastrophic failures).
There is a very simple rule in stock markets: Past performance is no indication for future performance. Too many things can change, both internally and externally, as that a prediction of the future is possible.
After all, all predictions are based on historical data!
Thus, you should consider if you covered all the edge cases, all the possible failure scenarios you can encounter, and prepare for them.
This discipline is called risk management, and was established in the NASA JPL after the Challenger disaster. It operates at the borders of statistics, physics and philosophy: Statistical data about the past, the law of physics and their predictions for the here and now, and philosophy (especially abstract logic, which is very closely related with mathematics) for the future.
A lot of businesses, particularly large ones, have established processes to judge and manage risk, but if you are working in a small team, such a process can be a burden.
However, you can deal with risk:
Identify
- What parameters can change that influence your software? Those are OS upgrades, compiler/interpreter/VM changes, hardware changes.
- Which elements of your whole stack are volatile (i.e. changing)? Those are hard disk space, memory usage under load, network connections.
- Which of these elements can we influence? Multiple data centers, fail-over architecture, redundant servers are all possible options.
- What can we not influence? Power outages, burglary, cloud provider failure, essentially everything where you buy expertise instead of having (or developing) it yourself.
Triage
- What is critical for success?
- What are low-hanging fruits?
- What is critical, but has low chances of failure?
Action
Throw the proper resources at the problem. Money solves infrastructure issues, time (which means salaries, which mean money!) solves code problems, insurance solves force majeure, by limiting the damage.
Given Enough Eyeballs, All Bugs Are Shallow
Shared code ownership, or responsibility for code, also means more eyes looking at the code, thus an increased likelihood to find—and remove!—edge cases. Tests insure that changes to the code don’t create regressions. Add tests even after you developed the feature. Integration testing exercises your code base in concert, discovering side effects of your code.
Measure Once, Measure Twice, Measure Always
Instrument your code, and add logging and monitoring. Gather data. Just because historical data doesn’t indicate future performance, it does allow to identify problems before they become failures.
Live on the Edge
Find your edge-cases, and test for them. Take your code where it was never intended to go, hire pen-testers to beef up your security, and fail constantly.
Loose Coupling
Accept Risk, Prepare for Failure
Sooner or later, something will go wrong. Have documentation in place to get up to speed again as soon as you can. Make yourself redundant, and document what you do. Keep documentation up to date.
Be Dedicated
Even though you aren’t building the space shuttle, learn from NASA, and take their hard-earned, costly and expensive lessons to heart. Space exploration is awesome—no reason we shouldn’t learn everything we can from it.