Effects are correlated, policy outcomes are not, and multi-factor explanations are hard

Link post

(Repeating some of “Policy debates should not appear one-sided”, but expanding upon it and then countering “Multiple Factor Explanations Should Not Appear One-Sided”.)

If asked directly, most people who consider themselves sophisticated thinkers would probably agree that “acknowledging trade-offs” regarding your proposed policy is the right thing to do. In other words, they would agree that policy debates should not appear one-sided. But if you only ever hear that side of the message, you might apply it to questions of facts as well, where it is not the correct stance. So, let us first investigate why questions of fact should actually appear one-sided.

The evidence for facts is one-sided

Consider the question of whether Earthly life arose by natural selection. If this were true, then it would have had a profound impact on everything in our world. That is, the answer to the question of whether or not the theory of natural selection is true, strongly affects the answer to other questions we can ask about the world. The following graphical model is meant to visualize this. The arrows indicate the direction of causality: if Earthly life arose by natural selection, then this is the cause (or one cause) for our finding fossils of extinct animals. It is not the fact that fossils came first and somehow caused natural selection.

causal diagram 1

(Evo: “Earthly life arose by natural selection”, Fos: “there are fossils from extinct species”, DNA: “the genetic code of all life on Earth is encoded by DNA”, Nic: “species occupy their own peculiar ecological niche”, Emb: “embryos from different species look very similar”, Psy: “human psychology is adapted to the living environment during the pleistocene”)

Now, suppose you have heard of this theory but you haven’t seen any evidence yet. (A highly unusual situation for a scientist – the hard part always is coming up with the hypothesis, and not finding the evidence.) So, you have the theory and the graph it induces, but you don’t yet know whether the graph actually corresponds to reality; you cannot tell with your current state of knowledge. One day, you find fossils. This increases the credence you give to this theory but it’s not certain yet; there may yet be other plausible theories. Next, you observe Darwin’s finches on the Galápagos Islands and it strikes you how well they are adapted to different food sources. Evo goes up in probability again. After many such observations, it becomes almost certain that all future observations will be compatible with Evo as well. How can we make such a bold claim? Shouldn’t we give more credence to the possibility that future evidence will surprise us? In fact, a skeptic of evolution could object: isn’t it suspicious that all discoveries allegedly support the theory of evolution? Such a perfect record can only be the result of fraud!

But that’s not true, because the pieces of evidence aren’t independent. They all have a common cause, Evo, and as such they are highly entangled/correlated. When it comes to correlated pieces of evidence, finding that they point in the same direction is exactly what you should expect. The above diagram is supposed to show this visually: the evidence feeds back into the hypothesis, but if the hypothesis has sufficient support, then it feeds into the evidence, increasing our expectation of finding compatible evidence in the future.

Still, in principle, it is perfectly possible to find refuting evidence for evolution. If, for example, it turned out that humans have extrasensory capabilities (ESP) that are useless from an evolutionary perspective (i.e., which don’t increase inclusive fitness), then that would be a clear strike against evolution. No adaptation that complex would have survived the pressure of natural selection without being useful. But – as should be the case – such a discovery seems quite unlikely, given all the evidence that has accumulated. You cannot, at the same time, assign a high probability to the evolution hypothesis being true and give a non-tiny probability to the prospect of finding counterevidence – doing so would violate the conservation of expected evidence.

Policy debates should indeed not appear one-sided

Now, let’s consider policy debates. Policy debates may seem completely different from questions of fact, but they can be described by a very similar causal diagram. The big difference is that questions of fact “have already happened” (and so we can observe the effects now) whereas the effects of proposed policies typically lie in the future where we can’t easily observe them. In fact, the effects might not even be observable in the future, because we might decide against adopting the policy. We can circumvent this problem by using Pearl-style counterfactuals – the question then becomes: if we had adopted this policy in the future, what would the effects have been. This allows us to draw the same kind of causal diagrams as above for the question of evolution.

As an example, consider the question of whether marijuana should be legalized. (In some places this has obviously moved beyond being just a policy debate and has become policy, but there are still many places where this is not the case.) I tried to summarize the analysis by Scott Alexander (including the update) in the following diagram:

causal diagram 1

(MJ: “marijuana is legalized”, Arr: “fewer people are arrested/go to prison due to marijuana possession”, Cost: “costs are lowered for police/legal system/prisons”, Rec: “fewer people get a criminal record”, Tax: “government gets additional taxes”, Use: “more people use marijuana”, Eff: “more people experience the positive effects of marijuana”, IQ: “more cases of slightly lowered IQ/higher chance of schizophrenia from marijuana”, Car: “more car accident fatalities from stoned drivers”)

These are not all potential consequences of such a policy; just the most important ones that Scott could think of (and I don’t have anything to add). As we live in a complex society, there are always many many consequences, and the first problem one faces when analyzing a policy is that it is very hard to think of all the relevant consequences. It’s (relatively) easy to evaluate a proposed mechanism that someone tells you; the hard part is coming up with the mechanism.

What we can see in this example is that a potential proponent of the policy might first discover all the positive effects: fewer people in prison, taxes saved from less prosecution, taxes gained from legal marijuana sales. But does that actually make it more likely to find other consequences that are positive as well? So, analogous to how confirming evidence became more likely in the evolution example as our credence for the hypothesis grew? No. There is no structure here that would produce a correlation in the effects’ expected welfare change. The arrows in the diagram I drew are for probability and not for utility. The policy might have many good effects but there is just no reason that all of them are good. In the evolution example, we were judging evidence by how compatible it was with a hypothesis and there was a two-way flow for increasing probability, but that is not what is going on here. Here we try to think of as many consequences as possible and then judge their impact on overall utility, but there is no two-way flow for utility – discovering a high-utility consequence doesn’t increase the expected utility of other not-yet-discovered consequences.

So, even if you have observed that democracy has many good consequences, this does not mean that consequences that you will investigate in the future are guaranteed to be good as well.

(This isn’t to say you cannot learn anything about democracy from your past observations; you can still, for example, conclude that when founding a new country, democracy would be a good choice of government, because it has worked well in other countries. Similarly, you can of course still make trade-offs; you can conclude that democracy is overall worth doing. But there is no guarantee that democracy has only good consequences.)

Humans seem to be prone to applying the kind of thinking that only works with the factual questions to policy debates as well. And the problem seems to be that we like to condense all our opinions about a thing into a simple good-bad scale (which is, I think, what the term affect heuristic refers to). With this kind of thinking, “high probability”, “high benefits” and “low risk” are all just summarized as “good”. And if a thing is connected to a “good” thing, then it’s probably also “good”. This sure is computationally convenient, but can lead you very astray.

There is another important lesson in Scott’s analysis. He found that the overall effect of legalized marijuana is most likely dominated by the effect on car accidents. And I suspect – though I don’t have formal evidence for it – that in most such cost-benefit analyses there is one factor (or maybe 2) that completely outweighs all the other ones, and I think this makes sense: it is unlikely that reality is arranged in such a way that you have, say, 8 consequences that are all of about the same size. Just because we categorized the effects neatly into 8 bins, does not mean the effect sizes agree with this categorization.

So, that’s why observations supporting a true theory are naturally one-sided, but policy debates are not. Next, we’ll look at observing causes (as opposed to effects) of a true fact, where the traps are more subtle.

Multiple causes get eaten by piranhas

The evolution example is an instance of trying to find a theory (in this case the theory of evolution) to explain observations (like, “where do all these animals come from?”). This is modeled by a common cause having many effects. But sometimes scientists want to investigate the opposite direction: given a certain event that we know happened, what were the causes? This is modeled as several causes having a common effect. A common failure mode here is to assert that an event had a large number of independent, about equally-important causes.

Consider the example from Stefan Schubert’s article, who in turn got it from Jared Diamond’s Guns, Germs and Steel, about the question of why agriculture first arose in the Fertile Crescent (and not elsewhere). Diamond identifies 8 factors that, according to him, determined the outcome. It’s not completely clear whether he considers them to be independent factors, but as I understand it, he does not try to identify the common causes among these factors, so he mostly treats them as independent.

The following diagram is a visual representation of this analysis:

causal diagram 1

(Cres: “agriculture first developed in the fertile crescent”, Seed: “big seeded plants, which were abundant”, Obv: “and occurring in large stands whose value was obvious”, Herm: “and which were to a large degree hermaphroditic selfers”, Annu: “It had a higher percentage of annual plants than other Mediterranean climate zones”, Div: “It had higher diversity of species than other Mediterranean climate zones”, Elev: “It has a higher range of elevations than other Mediterranean climate zones”, Dom: “It had a great number of domesticable big mammals”, nHG: “The hunter-gatherer life style was not that appealing in the Fertile Crescent”)

As you can see, this is in a way the opposite relation to the previous diagrams: one effect with multiple causes instead of one cause with multiple effects. But how plausible is it that an event in history really had eight independent causes? You might have the feeling that there is something wrong or suspicious about this model, but what exactly is it?

Now, in contrast to the policy outcomes discussed above, there is actually expected to be some correlation/entanglement between independent causes if we have already observed their common consequence, which we have, in this case. However, you might still find it suspicious that all the factors happened to be supporting factors (i.e., factors that make the observed outcome more likely). We can get out of this criticism on a technicality: no-one claimed that this list of factors is exhaustive!

Consider this: you could imagine the algorithm for generating these factors to have done the following: 1) Take a true fact; 2) make a list of other, vaguely related facts that are true, and are causally upstream (i.e., earlier in time) of the true fact from step 1; 3) pick only those items from the list that sound like they support (or at least don’t oppose) the fact from step 1. You are then left with a very one-sided list of supporting factors. Of course, this is not a complete picture and quite misleading, but such an analysis is also not false in the sense that there somehow couldn’t be that many supporting factors.

However, there is still a deep problem with the Fertile Crescent model, which I will show now. (There will be a bit of math.)

To make the discussion of the factors behind a historical event be about anything real at all, I believe we have to turn to counterfactuals again. We can determine the strength of a factor by asking: if we reached in – from outside the world – and manually modified the factor (for example, removing most of the domesticable animals (Dom), like cows and sheep), how much would this affect the outcome (for example, would agriculture not develop there)? Another way to ask that question: if you were shown an alternate Earth which looked exactly the same as ours except for those eight factors, which are initially kept hidden from you, and then you learn that there are, in fact, lots of domesticable animals in the Fertile Crescent, how much does this affect your estimate of the probability that agriculture will first arise in the Fertile Crescent in this alternate Earth as well? That is, how much information did you gain from observing one of the factors? In other words again: what is the mutual information between the factor and the outcome?

(Feel free to skip the next paragraph if you already know what mutual information is.)

Mutual information “quantifies the amount of information [for example in bits] obtained about one random variable by observing the other random variable” (source: wikipedia). But in case that wasn’t very clear, here is some intuition for why mutual information is the right quantity to use here: Say we have two random variables $x$ and $y$ which can take the values 0 or 1, and let’s say the prior for $x = 1$ is $P (x = 1) = 0.5$ . But, let’s say that, when we condition on $y = 1$ , then we have $P (x = 1 | y = 1) = 0.7$ . (Conversely, we must have $P (x = 0) = 0.5$ and $P (x = 0 | y = 1) = 0.3$ .) Clearly, $y$ is able to tell us something about $x$ . To quantify how much observing $y = 1$ tells us about what the value of $x$ will be, we can take the quotient $P (x | y) / P (x)$ , and we can take the logarithm of this to get some nice mathematical properties: $p m i (x; y) = log (P (x | y) / P (x))$ . (“pmi” stands for “pointwise mutual information”.) $p m i (x; y)$ is 0 exactly when $y$ does not convey any information about $x$ . It is maximal when $y$ fully determines $x$ . Mutual information is then simply the expected value of $p m i$ , taken over the joint distribution of $x$ and $y$ .

Now, with this concept of mutual information in hand, we can make very precise statements about the number and strengths of potential factors that influence an outcome. Given $p$ mutually independent random variables $X_{1}, . . ., X_{p}$ (i.e., the “causes”) and another random variable $Y$ (i.e., the “outcome”), the sum of all the mutual informations between $X_{i}$ and $Y$ is bounded by the entropy of $Y$ : $p \sum i = 1 I (X_{i}; Y) \leq H (Y)$ The (information) entropy $H$ of a random variable tells us how many bits we need on average to describe the outcome. It is maximal for a uniform distribution. For example, for a binary uniform distribution, the entropy is 1 bit.

The given inequality allows us to conclude that if there are 8 factors influencing an outcome, then at least some of them have to be pretty weak (or, alternatively, the factors are not mutually independent). Wikipedia tells me there were 7 places in the world where agriculture was invented at some point: the Fertile Crescent, the Yangtze and Yellow River basins, the Papua New Guinea Highlands, Central Mexico, Northern South America, sub-Saharan Africa, and eastern North America. Let’s round this up to 8, and pose the question: out of 8 equally likely population centers, which one will develop agriculture first? To answer this question, we need 3 bits of information. Spread equally over 8 factors, this means 0.375 bits of mutual information from each factor, which isn’t much.

(To be clear, we are still assuming here that the 8 factors are all completely independent which is most likely not the case; for example, the higher diversity of plants is plausibly related to the higher diversity of animals and thus the great number of domesticable animals. For dependent factors, the bits of mutual information may add up to more than the outcome entropy of 3 bits.)

The bound given above is a special case of the more general “Piranha Theorem for mutual information” (stated here by Tosh et al.). There is a similar theorem for correlation, which might be easier to visualize. Let $X_{1}, . . ., X_{p}$ be $p$ mutually independent real-valued random variables (with finite non-zero variance), and let $Y$ be another (real-valued) random variable, then the sum of the magnitude of the correlations is bounded in the following way: $\frac{1}{\sqrt{p}} p \sum i = 1 | c o r r (X_{i}, Y) | \leq 1$ With 8 independent factors, with equal influence on the outcome, this implies a correlation of ~0.35 with the outcome, per factor. (Correlation of 0.35 looks like this or like this.)

All this is to say, that there are strict limits to how many strong explanatory variables an event can have. And this should affect how we evaluate claims about events being explained by eight factors – we should conclude that those can only be pretty weak factors. And it sounds much less impressive to say “I have found 8 things that are weakly correlated with the fact that agriculture arose in the Fertile Crescent.”

Why are these bounds not an issue for the factual question from the beginning? Why didn’t we say there that the number of pieces of evidence that are strongly entangled with the hypothesis must be limited to at most 3 or so? After all the Piranha Theorems don’t care only about the mutual information and not about the direction of causality.

The difference is that a causal structure as in the first figure does not make any claims about the consequences being independent of each other. Quite the opposite in fact. The diagram clearly shows that the pieces of evidence have a common cause, which is the textbook example of a dependency. And with mutually dependent variables, the Piranha Theorems have much less bite.

On the other hand, the causal structure in the third figure is an implicit claim that the 8 causes are independent from each other. The technical reason for this is that the causes are connected by a “collider” which blocks the path, by the rules of d-separation. The intuitive reason is that the outcome of an event shouldn’t be affected by its consequences, and thus causes don’t become entangled just because they happen to affect the same event. So, the Piranha Theorems apply here in full strength.

(This is of course predicated on the causal diagram’s being the correct one, which is actually what we’re trying to determine. As mentioned before, there might, in reality, be other connections between the variables that don’t go through the common consequence/the collider.)

But apart from the necessarily limited strengths of the factors, there is another thing here that seems unrealistic. We come back now to the issue discussed at the end of the previous section: it is unlikely that all the connections in the causal diagram have the same strength. The analysis of legalized marijuana was in the end dominated by a single consequence. And it seems to me very likely that there is an analogous effect in the analysis of causes: In reality, there is likely one or two outsized factors and the rest is small to negligible.

Why would reality be that way? Well, to me the question is more like, why would reality not be that way. Eight equally strong factors is the weird edge case! Why would reality be so neatly organized as to divide the cause into 8 equally sized packages? It’s more of an anti-prediction that factors will likely drastically vary in effect size. The space of neatly balanced configurations is much smaller than that of unbalanced configurations. Thinking about it in terms of concrete numbers helps, I think, to make clear to our brain how unlikely this is: would you believe me if I tell you that the mutual information of all of Diamond’s factors is (in bits): 0.41, 0.40, 0.42, 0.39, 0.39, 0.40, 0.39, 0.40?

As a side note, one way that an a priori unlikely configuration can arise is if it was subjected to optimization in some form. But with that not being the case here, there is no reason to privilege the hypothesis that all factors have equal strength.

In summary, one or two out of Diamond’s 8 factors might be almost completely responsible for the observed outcome, with the other factors being close to negligible and maybe even actually opposing. I am not aware of any rule of probability theory or theory of causality which states that reality has to be like this but it seems we live in a simple universe where it is rare to have many equally-strong causes.

(I cannot give a fixed upper limit for how many (roughly) equal-strength causes something can have. I think two (roughly) equal factors are definitely something one can encounter. There are also actual cases of 3 such factors, as in this analysis by Vitalik Buterin who traced a 9% increase in an outcome variable back to 3 independent (unintended) 3% increases that happened during a single update to the Ethereum blockchain. But this case already feels very extraordinary to me, and if he didn’t have the numbers to back up his claim, I am not sure I would believe it.)

Sufficient causes allow you to cheat at prediction

Finally, as you might or might not have realized already, there is a big caveat to the preceding analysis: it only applies to models with a non-trivial number of necessary causes. A model with a large number of (individually) sufficient causes is fine.

Now, what does that look like? A “sufficient cause” here is something that on its own is already sufficient to produce the outcome in question. We can think of these sufficient causes as multiple paths leading to the same destination – such that overall, getting to that destination is quite likely. Consider – as a perhaps extreme example – someone wanting to build a new nuclear power plant right in the San Francisco Bay Area (bonus points if it’s sponsored by a billionaire). And let’s say A is the random variable corresponding to the prediction that this construction will delayed by at least 5 years. As America is a vetocracy, I have no trouble thinking of many potential causes for such a delay. And each of them is probably sufficient on its own to cause the full delay. As such, $P (A | C_{i})$ is close to 1 for all $i$ (where $C_{i}$ is the $i$ th potential cause). On top of that, the causes themselves are each quite likely, such that in combination, the final outcome is almost certain. It is overdetermined as they say. In general, if we have $n$ independent paths that each have a prior probability of $p$ of happening, then the chance that at least one ends up happening is $1 - (1 - p)^{n}$ . For example, with $n = 5$ and $p = 0.7$ , we get 99.76%. And if the conditional probabilities of the outcome given path $i$ are close to 1, then something like that number is also the probability of the outcome itself.

Now, in reality, only one of the paths will play out. (The fact that I don’t know exactly which one it will be is a fact about me as an imperfect predictor.) So we haven’t violated the heuristic that in actuality, events only have a quite small number of causes. The true path will get all the mutual information with the outcome, and the others will get none, in accordance with the Piranha Theorems.

As another example, imagine a universe, where general intelligence is for some reason impossible, but which otherwise looks basically like our universe. Life on Earth develops as it did on our Earth but whenever evolution produces a human with general intelligence, it mysteriously melts. In such a world, I could make the prediction that Earthly life will eventually go extinct – I don’t know exactly how it will happen, but given all the dangers in the universe (gamma ray bursts, large comets hitting Earth, the sun blowing up, Andromeda colliding with the Milky Way) it seems all but guaranteed. And it’s not suspicious that I’m so certain about this because the conclusion just follows directly from the fact that it’s an uncaring universe that is allowed to throw dangers at you that you cannot survive.

(This fact is, I guess, the actual ultimate cause and the planet-killing comet is just the proximate cause – same with the vetocracy being the ultimate cause in the other example. It depends on how you factor reality into variables and what you are concentrating on.)

A model with many sufficient causes makes, I think, most sense as a model for the future, where you are uncertain about the true path. If you can resist hindsight bias, you can also try it with past events. (E.g., was World War I overdetermined by 1914? Matthew Yglesias seems to think maybe not.) On the other hand, however, using a model with multiple necessary causes is especially tempting when looking at the past, because you can see all the preceding events. And then you can cleverly argue why they were all necessary.