An Ontology of Systemic Failures: Dragons, Bullshit Mountain, and the Cloud of Doom
Core Claim
I assert that a lot of value can be achieved by categorizing systemic failures into four broad categories.
For the sake of pithiness, I will name them “bugs”, “dragons”, “bullshit mountain”, and “the cloud of doom”.
A Bug is the simplest kind of failure: you have a single cause, and a single symptom. Fixing a bug is pretty easy, compared to other modes—you just fix the cause, and the symptom goes away. Bugs aren’t even really systemic failures, since they don’t involve any cross-talk between multiple causes or effects, but they are included here for completeness.
A Dragon is like a Bug, except that instead of a single symptom, there are multiple seemingly independent symptoms. Because the symptoms seem large and diverse, a Dragon will often seem far more daunting than it actually is—although most Dragons are still pretty large and significant causes. Dealing with a Dragon simply involves some particularly Heroic-type identifying that there’s a Dragon, hunting it down, and slaying it.
Bullshit Mountain is basically the opposite of the Dragon. There is one huge, unbearable, painful symptom, that everyone knows about and everyone wishes would just GO AWAY. But no one can get any traction on it. This is because that symptom is actually being contributed to by a thousand little causes, all of which only contribute a little—so making progress on any one of them feels like it doesn’t help much, if at all. The only way to solve Bulshit Mountain is for everyone in the org to roll up their sleeves, get a shovel, pick some little corner of Bullshit Mountain to work on, and start shoveling—and not stop until the problem gets noticeably better, even if their work doesn’t seem to be doing much to contribute to the improvement.
Finally, we have the Cloud of Doom. The Cloud of Doom is where you have a thousand tiny causes, each of which meaningfully contribute to each of a thousand little symptoms, which together make the whole system feel unworkable. The only way to fix the Cloud of Doom is to pour an ungodly amount of Slack into the system, and hope the cloud shakes loose and blows away—otherwise, everyone needs to just throw up their hands and abandon the whole mess.
A good question to ask yourself, when trying to tackle a seemingly insurmountable mess of problems, is: is this a Dragon, Bullshit Mountain, or a Cloud of Doom?
Slaying Dragons
If the problem is a Dragon, are you the one qualified to slay it? If you are, what further information or resources do you need? Do you need a team? Are you the right person to lead that team?
Because Dragons are single-cause problems, they respond well to a single person and a single plan. Most organizations hope (and therefore pretend) that most of their problems are Dragons, and most organizational problem-solving is dedicated to finding Dragons (or making problems look like Dragons) and then paying a few core Heroes big bucks to slay them.
Basically, slaying Dragons is a Solved Problem in the organizational world; most apparent lack of success involves mis-identifying Bullshit Mountains and Clouds of Doom as Dragons, in the naive hope that they’ll turn out to be Dragons anyways and therefore can be solved by a process that the system knows how to implement.
Shoveling Bullshit Mountain
If your problem is Bullshit Mountain, do you have enough buy-in to get enough people to start shoveling? Are you the right cheerleader to keep people motivated? Does the team still care enough to even want to solve the problem? These are way harder problems to tackle than the Dragon problems listed above, so expect a lot of people to handwave and convince you that the problem is a Dragon (if they want to signal buy-in for solving it) or a Cloud of Doom (if they don’t).
Dealing with Bullshit Mountain as if it was a series of Dragons is what causes “death marches” in the tech industry. It’s what makes transformative “business culture” experiments seem to work temporarily (by clearing out the mountain and replacing it with a new system, which then begins accumulating its own Bullshit). It’s the biggest contributor to low employee morale that a mid- to large-size “healthy” organization can have. (In fact, the transition from Bullshit Mountain into a Cloud of Doom is probably the tipping point for an organization becoming unsalvageable.)
Surviving The Cloud of Doom
Finally, we have the Cloud of Doom. Every organization has one. Some are big, some are small, some are more toxic than others. But an organization with 0% of its problems in a Cloud of Doom is an organization that has not yet had to do anything actually real.
So, you live with it. And the organization begins to develop other problems—mostly bugs, but some Dragons and a few Bullshit Mountains here and there. And as the Dragons get bigger and lay waste to more countryside, and as the Bullshit Mountains tower higher and higher overhead, they start to interweave and correspond, feeding the Cloud of Doom.
Eventually, the Cloud of Doom begins to actually choke the life out of your organization. That’s when you have a single choice: inject lots and lots of Slack, or leave.
Injecting lots and lots of Slack basically means “doing less with more”, which almost no one believes is the correct choice. But if you aren’t burning everything down and starting over somewhere else, it’s the only choice. If you can’t live with your Cloud of Doom, and you won’t flee it, you’re going to have to stop feeding it and let it blow away.
Harnessing the Cloud of Doom
One thing you CAN do, if you’re particularly vicious, is convince people that your Cloud of Doom is actually just Bullshit Mountain, and use that to extract work from your subordinates. You need to be exceptionally clever (in a Raoian sociopathic sense) in order to pull this off, because you basically need to manage both sides of the information and effort flow: you have to keep everyone believing that there’s a Bullshit Mountain that they’re biting into, AND you have to re-direct and manage their actual efforts so that they benefit your covert goals. Note that this is doable whether the actual problem is in fact a Cloud of Doom or merely Bullshit Mountain, and that applying this very process to Bullshit Mountain is one of the more common things that turns Bullshit Mountain into a Cloud of Doom in the first place.
If I catch you doing this within any organization that I am aligned with, I consider it within my rights to destroy you.
Dragonslayers are Gryffindor, Shovelers are Hufflepuff
Noticing what kind of problem you have the right temperment to solve is key to avoiding burn-out. Dragonslaying is glamorous, high-praise work; shoveling Bullshit Mountain is thankless and grueling, and the person who finally gets the praise is usually the guy that did the least actual work. From my perspective, Project Hufflepuff was in many ways a direct attempt to train up people who could handle the Rationalist-and-EA-community’s Bullshit Mountain before it turned into a Cloud of Doom. (Whether it’s too late now or not is a matter for the Slytherins to convince you; I will say no more on this today.)
One big problem with shovelers is that they still expect praise and reward for shoveling Bullshit Mountain; no one seems to be telling them that it ain’t gonna happen. (The smart ones figure this out on their own, and either go away or grit their teeth and get to work anyways.) As a system for motivating people to shovel, anything like Project Hufflepuff is doomed to fail from the beginning. What you need is a way to identify the people who are already doing the work, and make sure they stay funded and supported and nurtured. (Whether this problem itself is a Bullshit Mountain or a Cloud of Doom is, again, left as an exercise for the reader. A Dragon it ain’t, or Project Hufflepuff would have worked.)
Conclusions
So, anyway, yeah. Here we are. Do what you will with it. Or don’t; I’m not your dad.
This would have been more readable if you gave concrete examples of each kind of problem. It seems like your claim might be a useful dichotomy, but in its current state it’s likely not going to cause me to analyse problems differently or take different actions.
The difference between Mountains and Clouds seems to be the most critical. They’re both described as “problems with many small causes”, and now I know they need different strategies, but I don’t feel well equipped to notice differences, if any.
To be more specific, after rereading the article and thinking for a few minutes, the skill seems to be in correctly deciding whether to accept “everything is a little slow and painful!” as a single big symptom (Mountain), or seeing it as an excuse to not examine and uncover the many small symptoms contributing to that feeling (Cloud). Probably a good place for some heuristics on what bad diagnoses look like.
That still seems too vague to be useful. I don’t have the slack to do the work of generating good examples myself at the moment.
I don’t know how well the things I’ve been thinking about lately fit into this model, so I’m going to attempt to apply these to my own field (Youth Emergency Services for suicide/homicide/psychosis):
1) Bug is when a clinician gets a call about a crisis and fails to triage whether it’s an issue of sufficient severity. Mom calling because she took her kid’s phone away and now they’re yelling and won’t go to bed can probably be de-escalated over the phone, even if they said the magic words of “I’m going to kill myself,” as long as they don’t have a history of depression or suicidal ideation or self-harm. Simple fix is to train people to better triage and how to respond/deescalate remotely.
2) Dragon is kind of what the job was designed to deal with. Before mobile response teams existed for mental health workers, and in counties that don’t have them, people just call the police. The police will either sit the person down to give them a stern talking to, or arrest them, or take them to the nearest hospital’s psych ward, depending on what’s happening. Someone decided “Hey what if we sent therapists instead, who could de-escalate the problem instead and reserve the involuntary commitment for the cases that really warrant it?” Problem is it’s not really a “solvable” problem, which means...
3) Bullshit Mountain is sort of the job itself? There’s no Win Condition, it’s just an ongoing series of getting crisis calls and helping those involved as best we can, whether that’s teaching coping skills, creating safety plans, connecting them to services, or sometimes initiating a Baker Act (Florida protocol for involuntary 1-3 day stay at psych ward). Some people only call us once and everything’s fine in follow-up calls. Others call multiple times per month, sometimes even per week, and have been Baker Acted by us or police over a dozen times. But stats say the county is overall doing better for youth suicides/homicides and hospital psych wards have fewer kids being brought in, all this even with things like the Parkland shooting, so we keep plugging at it.
4) Cloud of Doom seems like it could be at least a couple things. The first that no one wants to be left holding the Liability Hot Potato for a suicide, let alone the next mass school shooter, so we have police departments and schools calling us if any kid says anything even remotely like “I want to kill my geometry teacher,” even if it was two weeks ago in a text message to their girlfriend, and then getting mad at us when we won’t Baker Act them.
Which leads to some Molochian bullshit, where organizations start to *set protocols* to call us for anything that seems remotely worrisome, and then *our protocols* say that we can’t be the ones to define a crisis for the caller, and so we get the occasional hilariously-frustrating call where a juvenile probation officer or school administrator calls us and says “They did/said X Y and Z, so do you need to make an assessment?” to which we have to reply “Well, if you say you want an assessment we’ll be right over,” to which they reply “Well they did X Y and Z,” to which we will reply “Yes, I understand, does that mean you want us to come assess?” to which they reply “Do you feel they need an assessment?” Either we decide what’s worth a crisis and liability falls on us, or the caller decides and liability falls on them.
You know where this is going: we’re getting a steadily increasing amount of calls, and our ability to triage them is being slowly but surely hemmed in by expectations from on-high.
The result being a second Cloud of Doom, where the YES team (and I imagine other mobile crisis teams in other counties) have tons of communal responsibility but no actual institutional power. Which is often demoralizing and makes it hard to retain staff. Which means we’re more often in situations where all our clinicians are busy out on calls, some of which we didn’t really need to go on, but more keep coming in and we have to tell them to either take their kids to the hospital themselves or call the police if they believe it’s an immediate crisis. Which was kind of what the job was originally designed to prevent from happening.
Does this seem to fit, or am I missing the essence of some of these?
I’d have guessed that the liability hot potato is a not a Cloud of Doom but in fact a Bullshit Mountain, as defined in the text. There’s a well-defined problem—avoidance of anything that could create liability leads to institutional paralysis—and a bunch of ways it happens.
Ah, maybe. I was under the impression that CoDs are the emergent properties of multiple Dragons or BSMs interacting, and their main feature seemed to me that they are the thing that “gums up the works” and makes it harder for people in a system that’s trying to solve problems to actually do so.
The liability hot potato itself is a Bullshit Mountain. Once the liability hot potato becomes a cause for multiple symptoms downstream of it, you’re in Cloud of Doom territory. So the ultimate problem is contextual—are you operating at a level of control where you can directly confront the LHP? If so, pick your causes and start shoveling. Or are you at a level of control where the downstream effects of the LHP are themselves the landscape you have to navigate? If so, welcome to your Cloud of Doom.
Why can’t you start shoveling those CoDs to pull off dragons from it? I’m not very familiar with therapy, but revision of business process (a what-to-do-in-which-case instruction) is usually a good way to handle power/responsibility problems. Finding occurancies when people have responsibility for something but have no power to change it and defining how to manage these cases should help reducing overall CoD. I’m a bit confused that article predicts that it only will make a CoD worse, I wonder why.
This is an interesting classification.
Questions:
What exactly does “injecting Slack” mean? (Both in theory, and in practice?)
The “Harnessing the Cloud of Doom” section is rather cryptic; could you expand on it?
What are some examples of each kind of problem? (Three examples per category would be ideal. But any at all would be a well-appreciated start!)
I would also like to know the answers to these. I know that “injecting Slack” is a reference to Zvi’s conception of Slack.
1. I see you haven’t been reading other articles very closely. Given that, I don’t expect you to have read this one very closely. Or that you would read a long reply that I might give very closely. Therefore, why should I spend the effort on it, just so we can get into another arc of pedantry? I don’t really have a stake in it, you see.
2. Basically the same answer as 1. If someone else wants to expand, I’m sure they can; I’d appreciate if they did, but not to feed you.
3. Looks like Daystar Eld already started here, I think it’d be neat if other people would provide more.
This sort of seems like an example of 2: https://www.vox.com/policy-and-politics/2017/9/8/16257502/verrit-peter-daou-aweseome (and more generally, all attempts to collect credit for Creating/Maintaining a Canonical Information Source in an environment where there’s an open infowar.
(I am having trouble parsing this comment in the context of ialdabaoth’s comments, but think it might be saying something interesting. Is the 2 in this comment referring to ialdabaoth’s comment, or the 2. in Said’s comments, or to a 2 listed somewhere in the article?)
Example of Said’s 2, i.e. example of harnessing cloud of doom. Should probably post a separate comment about it.
Interesting and elegant model!
I’m having trouble parsing what the Cloud of Doom is. It sounds similar to a wicked problem. Wicked problems come with the issue that there’s no clear best solution, which perhaps is true of Clouds of Doom as well. On the other hand, you make two claims about wicked problems:
Every organization doing real work has them
There’s one way to solve them, by adding lots of slack
I’m not sure where those are coming from, or what those imply. Examples or explanations would help.
Another thought: after the creation of vaccines, smallpox was arguably a “bug”. There’s a clear problem (people infected with a specific organism) and a clear solution (vaccinate a bunch of people and then check if it’s gone). It still took a long time and lots of effort. Perhaps I’m drawing the analogy farther than you meant it to imply. (Or perhaps “a bunch of people” is doing the heavy lifting here and in fact counts as many little problems.)
One aside:
I mention in ‘Shovelers are Hufflepuff’ that the credit for solving a Bullshit Mountain doesn’t go to the Hufflepuffs who actually solve it.
What DOES happen is, it goes to the Gryffindors who rush in to slay the biggest Dragon that the shovelers uncover. Since the Dragon-slaying is the biggest salient change, all progress gets attributed to it, including the progress made by the shovelers clearing out Bullshit Mountain in the first place.
If you want to poach Hufflepuff virtue, the best way to do it is to be the kind of Gryffindor that knows how to get along with Hufflepuffs, and then slay all the dragons as they uncover them. You probably won’t even be resented by them for it!
You’ll still be a bit of a dick, though.
“Scientists with notable discoveries” might be an example of Gryffindors.
In this framework, Bullshit Mountain and Cloud of Doom have two distinguishing factors:
1. Bullshit Mountain is multi-cause, single symptom, vs Cloud of Doom being multi-cause, multi-symptom
2. Bullshit Mountain is recoverable, Cloud of Doom is unrecoverable. (Or at least, Cloud of Doom as framed here is stronger evidence that you either need drastic changes or to give up)
Both distinctions seem worth being aware of, but I’m not sure how natural it is to cluster them together. It seems like a sufficiently big Bullshit Mountain could make the situation unrecoverable, or a sufficiently small cluster of problems/symptoms could require “everyone picks up a shovel and digs and probably doesn’t really get proper credit but the organization is still pretty functional.”
I like the basic idea of the classification. I suggest “Hydra” instead of “Dragon”, since you specifically mention multiple seemingly independent heads/symptoms. If I were to only read the comments, I would think a Dragon was just a particularly large or difficult Bug; I don’t know if that means people are letting the definition slip in that direction.
I think I need to chew on this more and think about how much usefully breaks down along these lines. As I read this, you’re describing a correlation between a 2×2 matrix of bimodal levels of multiplicity of causes and effects, and good strategies for dealing with problems with those traits. Is that accurate? But there’s also a very distinct feeling that each of these categories evokes (especially given the names), and I’m not as sure that the feeling is correlated with the purported criteria; I have an intuitive guess that it’s more correlated with perceptions of agency over problems, which may have only a skewed relation to the “number” of causes and effects (insofar as that’s meaningful in the first place).
You’re not the first to suggest s/Dragon/Hydra/g here, and I’d be tempted to agree, if not for the fact that dragon-slaying is significantly more poetic than hydra-slaying. OTOH, “Hydra” serves as a mnemonic that attacking symptoms is a Known Bad Strategy.
(Do note that the existence of a dragon can cause a series of not-obviously-related symptoms—this stuff is on fire, and this stuff is smashed up, and these people got eaten...)
If I’m not the first, was this posted before? I don’t see the same suggestion elsewhere in the comments, at least…
And the part I’m worried about above is that the poetic view will lead to conflationary thinking about the categories along the way, rendering the model a lot less useful; sure, a dragon can cause multiple symptoms, but that’s not the central image that comes to mind (at least to me), and trying to get a grip on something like this as an intuition pump gets fragile if you lean into what sounds compelling.
No, I’m referencing an in-person conversation. (Incidentally, the fact that ialdabaoth fielded that suggestion and still wrote this post with ‘dragon’ makes me worry that they’ve got at least an instinct that it’s the right word in some way I’m missing.)
And I think I see the worry that you’re pointing at here. I think it’s a valid one, though not one that I expect can be resolved entirely through theory; I’d like to see some people work with the ontology for a bit to see which words work in useful ways.
This seems like an example of trying to harness the cloud of doom. (In that case the attempt was transparent enough not to work; examples that actually did work would of course be hard to establish shared beliefs about.) The cloud of doom is the breakdown of shared discourse due to collapse of trust in common definers of canonical reality. It’s harnessed by an organization directly trying to claim credit for being canonical with officialness theater like verification numbers, in a way that’s couldn’t even plausibly slightly alleviate the underlying problem.
I’m going to steal this, I’ll probably try to use a continuous relaxation of it and try to break it into causal parts and such
Yeah, strong endorsement of treating this as eigenvectors rather than category-buckets.
Bullshit Mountain sounds a bit like a situation where there’s a convergent focus for costs to be externalized onto. It’s hard to fix because local incentives are always towards not just ignoring the problem, but actively making it worse. This can on very rare occasions be “fixed” with massive investments of energy (with an opportunity cost that may or may not be worth it). Sometimes, though, an organization should have a finite lifespan, and the correct response to Bullshit Mountain is to manage the decline with harm-reduction.
More generally, it seems to me that this post subtly frames things as though the only organization is the one being focused on. Spinoffs and emigrés can often have perfectly good lives elsewhere.
Overall, I expect that as I reflect on this schema I’ll want to start using it, and I notice myself feeling the lack of a corresponding ontology of system capacity levels.