There’s an awkward valley between “reasonably reliable, but with a major outage every few years in a storm or something” and “completely reliable, and you can trust your life on it” where the system is reliable enough that we stop thinking of it as something that might go away but it’s not so reliable that we should.
Apropos of my other comment on SRE/complex system failure applications to writing/math, this is a known practice: if a service is too reliable for a time and has exceeded its promised ‘error budget’, it will be deliberately taken down to make sure the promised number of errors happen.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds...Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is (as happened with Chubby: see “The Global Chubby Planned Outage”), and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.
“The Global Chubby Planned Outage”
[Written by Marc Alvidrez]
Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
...Don’t overachieve:
Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads.
Thanks! Chubby planned outages were in fact one of the things I was thinking about in writing this, but I hadn’t known that it was public outside Google.
(Quite a lot is public outside Google, I’ve found. It’s not necessarily easy to find, but whenever I talk to Googlers or visit, I find out less than I expected. Only a few things I’ve been told genuinely surprised me, and honestly, I suspected them anyway. Google’s transparency is considerably underrated.)
Apropos of my other comment on SRE/complex system failure applications to writing/math, this is a known practice: if a service is too reliable for a time and has exceeded its promised ‘error budget’, it will be deliberately taken down to make sure the promised number of errors happen.
From ch4
Thanks! Chubby planned outages were in fact one of the things I was thinking about in writing this, but I hadn’t known that it was public outside Google.
(Quite a lot is public outside Google, I’ve found. It’s not necessarily easy to find, but whenever I talk to Googlers or visit, I find out less than I expected. Only a few things I’ve been told genuinely surprised me, and honestly, I suspected them anyway. Google’s transparency is considerably underrated.)