Confound it! Correlation is (usually) not causation! But why not?
It is widely understood that statistical correlation between two variables ≠ causation. But despite this admonition, people are routinely overconfident in claiming correlations to support particular causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimating the prevalence of confounds/common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in DAGs, the imbalance also explains overconfidence.
Full article: http://www.gwern.net/Causality
- 24 Jun 2014 1:49 UTC; 56 points) 's comment on Open thread, 23-29 June 2014 by (
- 28 Jul 2014 18:21 UTC; 2 points) 's comment on Open thread, July 21-27, 2014 by (
- 9 Jul 2014 3:08 UTC; 0 points) 's comment on Q: Correlation often does imply Causation, but does not specify which kind? by (
Hi, I will put responses to your comment in the original thread here. I will do them slightly out of order.
A Bayesian network is a statistical model. A statistical model is a set of joint distributions (under some restrictions). A Bayesian network model of a DAG G with vertices X1,...,Xk = X is a set of joint distributions that Markov factorize according to this DAG. This set will include distributions of the form p(x1,...,xk) = p(x1) … p(xk) which (trivially!) factorize with respect to any DAG including G, but which also have additional independences between any Xi and Xj even if G has an edge between Xi and Xj. When we are talking about trying to learn a graph from a particular dataset, we are talking about a particular joint distribution in the set (in the model). If we happen to observe a dependence between Xi and Xj in the data then of course the corresponding edge will be “real”—in the particular distribution that generated the data. I am just saying the DAG corresponds to a set rather than any specific distribution for any particular dataset, and makes no universally quantified statements over the set about dependence, only about independence. Same comment applies to causal models—but we aren’t talking about just an observed joint anymore. The dichotomy between a “causal structure” and a causal model (a set of causal structures) still applies. A causal model only makes universally quantified statements about independences in “causal structures” in its set.
I will try to clarify this (assuming you are ok w/ interventions). Your question is “why is correlation usually not causation?”
One way you proposed to think about it is combinatorial for all pairwise relationships—if we look at all possible DAGs of n vertices, then you conjectured that the number of “pairwise causal relationships” is much smaller than the number of “pairwise associative relationships.” I think your conjecture is basically correct, and can be reduced to counting certain types of paths in DAGs. Specifically, pairwise causal relationships just correspond to directed paths, and pairwise associative relationships (assuming we aren’t conditioning on anything) correspond to marginally d-connected paths, which is a much larger set—so there are many more of them. However, I have not worked out the exact combinatorics, in part because even counting DAGs isn’t easy.
Another way to look at it, which is what Sander did in his essay, is to see how often we can reduce causal relationships to associative relationships. What I mean by that is that if we are interested in a particular pairwise causal relationship, say whether X affects Y, which we can study by looking at p(y | do(x)), then as we know in general we will not be able to say anything by looking at p(y | x). This is because in general p(y | do(x)) is not equal to p(y | x). But in some DAGs it is! And in other DAGs p(y | do(x)) is not equal to p(y | x), but is equal to some other function of observed data. If we can express p(y | do(x)) as a function of observed data this is very nice because we don’t need to run a randomized trial to obtain p(y | do(x)), we can just do an observational study. When people “adjust for confounders” what they are trying to do is express p(y | do(x)) as a function \sum_c p(y | x,c) p(c) of the observed data, for some set C.
So the question is, how often can we reduce p(y | do(x)) to some function of observed data (a weaker notion of “causation might be some sort of association if we massage the data enough”). It turns out, not surprisingly, that if we pick certain causal DAGs G containing X and Y (possibly with hidden variables), there will not be any function of the observed data equal to p(y | do(x)). What that means is that there exist two causal structures consistent with G which disagree on p(y | do(x)) but agree on the observed joint density. So the mapping from causal structures (which tell you what causal relationships there are) to joint distributions (which tell you what associative relationships there are) is many to one in general.
It will thus generally (but not always given some assumptions) be the case that a causal model will contain causal structures which disagree about p(y | do(x)) of interest, but agree on the joint distribution. So there is just not enough information in the joint distribution to get causality. To get around this, we need assumptions on our causal model to prevent this. What Sander is saying is that the assumptions we need to equate p(y | do(x)) with some function of the observed data are generally quite unrealistic in practice.
Another interesting combinatorial question here is: if we pick a pair X,Y, and then pick a DAG (w/ hidden variables potentially) at random, how likely is p(y | do(x)) to be some function of the observed joint (that is, there is “some sense” in which causation is a type of association). Given a particular such DAG and X,Y I have a poly-time algorithm that will answer YES/NO, which may prove helpful.
I understand what you are saying, but I don’t like your specific proposal because it is conflating two separate issues—a combinatorial issue (if we had infinite data, we would still have many more associative than causal relationships) and a statistical issue (at finite samples it might be hard to detect independences). I think we can do an empirical investigation of asymptotic behavior by just path counting, and avoid statistical issues (and issues involving “unfaithful” or “nearly unfaithful” (faithful but hard to tell at finite samples) distributions).
Nerd sniping question:
What is “\sum{G a DAG w/ n vertices} \sum{r is a directed path in G} 1” as a function of n?
What is “\sum{G a DAG w/ n vertices} \sum{r is a marginal d-connected path in G} 1” as a function of n?
A path is marginal d-connected if it does not contain → ← * as a subpath.
Edit: I realized this might be confusing, so I will clarify something. I mentioned above that within a given causal model (a set of causal structures) the mapping from causal structures (elements of a “causal model” set) to joint distributions (elements of a “statistical model consistent with a causal model” set) is in general many to one. That is, if our causal model is of a DAG A → B ← H → A (H not observed), then there exist two causal structures in this model that disagree on p(b | do(a)), but agree on p(a,b) (observed marginal density).
In addition, the mapping from causal models (sets) to statistical models (sets) consistent with a given causal model is also many to one. That is, the following two causal models A → B → C and A ← B ← C both map onto a statistical model which asserts that A is independent of C given B. This issue is different from what I was talking about. In both causal models above, we can obtain p(y | do(x)) for any Y,X from { A, B, C } as function of observed data. For example p(c | do(a)) = p(c | a) in A → B → C, and p(c | do(a)) = p(c) in A ← B ← C. So in some sense the mapping from causal structures to joint distributions is one to one in DAGs with all nodes observed. We just don’t know which mapping to apply if we just look at a joint distribution, because we can’t tell different causal models apart. That is, these two distinct causal models are observationally indistinguishable given the data (both imply the same statistical model with the same independence). To tell these models apart we need to perform experiments, e.g. in a gene network try to knock out A, and see if C changes.
Naively, I would expect it to be closer to 600^600 (the number of possible directed graphs with 600 nodes).
And in fact, it is some complicated thing that seems to scale much more like n^n than like 2^n: http://en.wikipedia.org/wiki/Directed_acyclic_graph#Combinatorial_enumeration
There’s an asymptotic approximation in the OEIS: a(n) ~ n!2^(n(n-1)/2)/(M*p^n), with M and p constants. So log(a(n)) = O(n^2), as opposed to log(2^n) = O(n), log(n!) = O(n log(n)), log(n^n) = O(n log(n)).
It appears I’ve accidentally nerdsniped everyone! I was just trying to give an idea that it was really really big. (I had done some googling for the exact answer but they all seemed rather complicated, and rather than try and get an exact answer wrong, just give a lower bound.)
If we allow cycles, then there are three possibilities for an edge between a pair of vertices in a directed graph: no edge, or an arrow in either direction. Since a graph of n vertices has n choose 2 pairs, the total number of DAGs of n vertices has an upper bound of 3^(n choose 2). This is much smaller than n^n.
edit: the last sentence is wrong.
Gwern, thanks for writing more, I will have more to say later.
It is much larger. 3nchoose2 = ((√3 ^{n-1})^n), and (√3 ^{n-1}) is much larger than n.
3^(10 choose 2) is about 10^21.
Since the nodes of these graphs are all distinguishable, there is no need to factor out by graph isomorphism, so 3^(n choose 2) is the exact number.
The precise asymptotic is
%202%5E{\binom{n}{2}}%20\omega%5E{-n}), as shown on page 4 of this article. Here lambda and omega are constants between 1 and 2.That’s the number of all directed graphs, some of which certainly have cycles.
So it is. 3^(n choose 2) >> n^n stands though.
A lower bound for the number of DAGs can be found by observing that if we drop the directedness of the edges, there are 2^(n choose 2) undirected graphs on a set of n distinguishable vertices, and each of these corresponds to at least 1 DAG. Therefore there are at least that many DAGs, and 2^(n choose 2) is also much larger than n.
Yup you are right, re: what is larger.
So, um … how do we assess the likelihood of causation, assuming we can’t conduct an impromptu experiment on the spot?
The keywords are ‘causal discovery,’ ‘structure learning.’ There is a large literature.
The main way to correct for this bias toward seeing causation where there is only correlation follows from this introspection: be more imaginative about how it could happen (other than by direct causation).
[The causation bias (does it have a name?) seems to express the availability bias. So, the corrective is to increase the availability of the other possibilities.]
Maybe. I tend to doubt that eliciting a lot of alternate scenarios would eliminate the bias.
We might call it ‘hyperactive agent detection’, borrowing a page from the etiology of religious belief: https://en.wikipedia.org/wiki/Agent_detection which now that I think about it, might be stem from the same underlying belief—that things must have clear underlying causes. In one context, it gives rise to belief in gods, in another, interpreting statistical findings like correlation as causation.
Hmm, a very interesting idea.
Related to the human tendency to find patterns in everything, maybe?
Yes. Even more generally… might be an over-application of Occam’s razor: insisting everything be maximally simple? It’s maximally simple when A and B correlate to infer that one of them causes the other (instead of postulating a C common cause); it’s maximally simple to explain inexplicable events as due to a supernatural agent (instead of postulating a universe of complex underlying processes whose full explication fills up libraries without end and are still poorly understood).
That sounds more like a poor understanding of Occam’s razor. Complex ontologically basic processes is not simpler than a handful of strict mathematical rules.
Of course it’s (normatively) wrong. But if that particular error is what’s going on in peoples’ heads, it’ll manifest as a different pattern of errors (and hence useful interventions) than an availability bias: availability bias will be cured by forcing generation of scenarios, but a preference for oversimplification will cause the error even if you lay out the various scenarios on a silver platter, because the subject will still prefer the maximally simple version where A->B rather than A<-C->B.
That is another aspect, I think, but I I’d probably consider the underlying drive to be not the desire for simplicity but the desire for the world to make sense. To support this let me point out another universal human tendency—the yearning for stories, narratives that impose some structure on the surrounding reality (and these maps do not seek to match the territory as well as they can) and so provide the illusion of understanding and control.
In other words, humans are driven to always have some understandable map of the world around them, any map, even if not very good and even if it’s pretty bad. The lack of some map, the lack of understanding (even if false) of what’s happening is well-known to lead to severe stress and general unhappiness.
Seems to me like a special case of privileging the hypothesis?
You’re missing a 4th possibility. A & B are not meaningfully linked. This is very important when dealing with large sets of variables. Your measure of correlation will have a certain percentage of false positives, and discounting the possibility of false positives is important. If the probability of false positives is 1/X you should expect one false correlation for every X comparisons.
XKCD provides an excellent example. jelly beans
I’m pointing out that your list isn’t complete, and not considering this possibility when we see a correlation is irresponsible. There are a lot of apparent correlations, and your three possibilities provide no means to reject false positives.
You are fighting the hypothetical. In the least convenient possible world where no dataset is smaller than a petabyte and no one has ever heard of sampling error, would you magically be able to spin the straw of correlation into the gold of causation? No. Why not? That’s what I am discussing here.
I suggest you move that point closer to the list of 3 possibilities—I too read that list and immediately thought, ”...and also coincidence.”
The quote you posted above (“And we can’t explain away...”) is an unsupported assertion—a correct one in my opinion, but it really doesn’t do enough to direct attention away from false positive correlations. I suggest that you make it explicit in the OP that you’re talking about a hypothetical in which random coincidences are excluded from the start. (Upvoted the OP FWIW.)
(Also, if I understand it correctly, Ramsey theory suggests that coincidences are inevitable even in the absence of sampling error.)
I agree with gwern’s decision to separate statistical issues from issues which arise even with infinite samples. Statistical issues are also extremely important, and deserve careful study, however we should divide and conquer complicated subjects.
I also agree—I’m recommending that he make that split clearer to the reader by addressing it up front.
I see. I really didn’t expect this to be such an issue and come up in both the open thread & Main… I’ve tried rewriting the introduction a bit. If people still insist on getting snagged on that, I give up.
It ends with “etc.” for Pete’s sake!
...no it doesn’t?
A critical mistake in the lead analysis is false assumption: where there is a causal relation between two variables, they will be correlated. This ignores that causes often cancel out. (Of course, not perfectly, but enough to make raw correlation a generally poor guide to causality.
I think you have a fundamentally mistaken epistemology, gwern: you don’t see that correlations only support causality when they are predicted by a causal theory.
If two variables are d-separated given a third, there is no partial correlation between the two, and the converse holds for almost all probability distributions consistent with the causal model. This is a theorem (Pearl 1.2.4). It’s true that not all causal effects are identifiable from statistical data, but there are general rules for determining which effects in a model are identifiable (e.g., front-door and back-door criteria).
Therefore I don’t see how something like “causes often cancel out” could be true. Do you have any mathematical evidence?
I see nothing of this “fundamentally mistaken epistemology” that you claim to see in gwern’s essay.
Causes do cancel out in some structures, and Nature does not select randomly (e.g. evolution might select for cancellation for homeostasis reasons). So the argument that most models are faithful is not always convincing.
This is a real issue, a causal version of a related issue in statistics where two types of statistical dependence cancel out such that there is a conditional independence in the data, but underlying phenomena are related.
I don’t think gwern has a mistaken epistemology, however, because this issue exists. The issue just makes causal (and statistical) inference harder.
I agree completely.