matto comments on How factories were made safe

matto 4 Oct 2021 4:13 UTC
10 points
It seems to me that a similar shift in thinking is taking place in the software industry under the label of “SRE.”

(Disclaimer: I’m a site reliability engineer)

You write:

Thus, accidents were typically attributed to worker “carelessness.” (...) Since the contribution of the work environment to safety was not yet acknowledged, accidents were seen as largely unavoidable. (...)

Replace the word “accidents” with “bugs” and you have a pretty accurate picture of software development up until about 2000, or so the greybeards and some of my history readings tell me. Basically, each aspect of writing a program was highly individualized, with developers collecting cool tricks like efficient algorithms or highly reusable functions. Some of this was passed on person to person, but a lot of it simply vanished when someone changed jobs.

Then, things started to get better, with things like automated testing becoming mainstream, until we started building applications that ran on distributed systems. That level of complexity couldn’t be adequately handled in order to provide a reliable service. This was at a time when stuff like google maps or email were becoming ingrained in people’s daily lives. So we began talking about “devops”—a way for developers (who built applications) and operations (who ran the applications) to collaborate. And to cut a long story short, at some point the SRE paradigm was born, which I consider an implementation of the ideas that devops generated. SRE itself is very similar to this:

The new corps of safety engineers began fixing the root causes of injuries. They put guards and enclosures around the moving parts of machines. They invented locks for the machines, so that they couldn’t be turned on unless those guards were in place. They created automatic shutoff mechanisms.

My day to day involves tracking down (root-causing) outages, improving observability (ensuring good signal about system health), automating everything from deployments to developer tooling in order to decrease the risk of a breaking change making its into production and taking the service down. The key difference now is that my work doesn’t save any lives, but I like to think that it makes the lives of my users and colleagues better.

Also, this rings extremely true:

The new corps of safety engineers began fixing the root causes of injuries. They put guards and enclosures around the moving parts of machines. They invented locks for the machines, so that they couldn’t be turned on unless those guards were in place. They created automatic shutoff mechanisms.

I’ve gotten a good deal of pushback from experienced developers who consider the whole devops/SRE thing a passing fad, who also think that minutes or hours of downtime is part of the game, and that there’s simply no way around bugs and that we’re all doomed to mitigate outages at 3am on Sundays.