As far as I understand it, if variables A and B are correlated, then we can be pretty damn sure that either:
A causes B
B causes A
there’s a third variable affecting both A and B.
(Am I right about this or is this an oversimplification?)
A good way to grab attention might be to deny a commonly believed fact in a way that promises intelligent elaboration. So the website could start with a huge ‘Correlation does not imply causation’ banner and then go like ‘well, actually, it kind of does’. And then explain how going from not knowing anything at all to knowing that one of three causal hypotheses is correct is pretty damn informative even if we don’t immediately know which of the hypotheses is correct.
Then it would probably be useful to go all Bayesian and talk about priors, Ockham’s razor and how it’s a rare situation where we cannot distinguish between hypotheses at all. A good example might be to tell the story of how R. A. Fisher used the ‘correlation does not imply causation’ platitude to shoot down research connecting smoking to lung cancer and explain that it should have been clear that the hypothesis ‘smoking causes cancer’ was much more reasonable at that time than the hypothesis ‘there’s a common factor causing both smoking and cancer’. (On the other hand, this could turn political. I don’t know whether the smoking and lung cancer issue is still contested.)
if variables A and B are correlated, then we can be pretty damn sure that either: a) A causes B b) B causes A c) there’s a third variable affecting both A and B.
There is in fact a d) A and not-B both can cause some condition C that defines our sample.
Example: Sexy people are more likely to be hired as actors. Good actors are also more likely to be hired as actors. So if we look at “people who are actors,” then we’ll get people who are sexy but can’t really act, people who are sexy and can act, and people who can act and aren’t really sexy. If sexiness and acting ability are independent, these three groups will be about equally full.
Thus if we look at actors in general in our simple model, 2⁄3 of them will be sexy and 2⁄3 of them will be good actors. But of the ones who are sexy, only 1⁄2 will be good actors. So being sexy is correlated with being a bad actor! Not because sexiness rots your brain (a), or because acting well makes you ugly (b), and not because acting classes cause both good acting and ugliness, or diet pills cause both beauty and bad acting (c). Instead, it’s just because how we picked actors made sexiness and acting ability “compete for the same niche.”
Similar examples would be sports and academics in college, different sorts of skills in people promoted in the workplace, UI design versus functionality in popular programs, and so on and so on.
if C or its effect attains a particular (not necessarily recorded) value. Statisticians know this as Berkson’s bias, which is a form of selection bias. In AI, this is known as “explaining away.” Manfred’s excellent example falls into category (4), with C observed to equal “hired as actor.”
Beware: d-separation applies to causal graphical models, and Bayesian networks (which are statistical and not causal models). The meaning of arrows is different in these two kinds of models. This is actually a fairly subtle issue.
Odd—I always felt like d-separation was the same thing on causal diagrams and on Bayes networks. Although, I also understood Bayes network as being a model of the causal directions in a situation, so perhaps that’s why.
Manfred’s excellent example needs equally excellent counterparts for other possibilities.
Sorry for not being clear. The d-separation criterion is the same in both Bayesian networks and causal diagrams, but its meaning is not the same. This is because an arrow A → B in a causal diagram means (loosely) that A is a direct cause of B at the level of granularity of the model, while an arrow A → B in a Bayesian network has a more complicated to explain meaning having to do with the Markov factorization and conditional independence. D-separation talks about arrows in both cases, but asserts different things due to a difference in the meaning of those arrows.
A Bayesian network model is just a statistical model (a set of joint distributions) associated with a directed acyclic graph. Specifically it’s all distributions p(x1, …, xk) that factorize as a product of terms of the form p(x_i | parents(x_i)). Nothing more, nothing less. Nothing about causality in that definition.
I think examples for (1),(2),(3) are simpler than Manfred’s Berkson’s bias example.
(1) A ← C → B
Most clearly non-causal associations go here: “shoe size correlates with IQ” and its kin.
(2) A → C → B, and
(3) A ← C ← B
Classic scientific triumphs go here: “smoking causes cancer.” Of note here is that if we can find an observable unconfounded C that intercepts all/most of the causal pathway, this is extremely valuable for estimating effects. If you can design an experiment with such a C, you don’t even have to randomize A.
Aha, yes—which I think I in turn was linked to by Ben Goldacre. But the reason I was quickly able to enumerate this as a separate kind of correlation is because the causal graph is different, which would be Judea Pearl.
I mean “I’m about to pretend that ‘sexy’ and ‘good actor’ are binary variables centered to make the math super easy.” If you would like less pretending, read the Atlantic article linked by a thoughtful replier, since the author draws the nice graph to prove the general case.
I wouldn’t like less pretending and ‘sexy’/‘good actor’ being binary variables is fine with me (and I understand your comment overall), but still I don’t know what does it mean that the groups are equally full. (Equal size? That doesn’t follow from independence.)
Right, so I make the math-light but false assumption that casting directors will take above-average applicants, and also that you aren’t more likely to eventually become an actor if you’re sexy and can act well.
There’s also e): A causes B within our sample, but A does not cause B generally, or in the sense that we care about.
For example, suppose a teacher gives out a gold star whenever a pupil does a good piece of work, and this causes the pupil to work harder. Suppose also that this effect is greatest on mediocre pupils and least on the best pupils—but the best pupils get most of the gold stars, naturally.
Now suppose an educational researcher observes the class, and notes the correlation between receiving a gold star, and increased effort. This is genuine causation. He then concludes that the teacher should give out more gold stars, regardless of whether the pupil does a good piece of work or not, and focus the stars on mediocre pupils. This change made, the gold stars no longer cause increased effort. The causation disappears! Changing the way the teacher hands out the gold stars changes the relationship between gold stars and effort. So although there was genuine causation in the original sample, there is no general causation, or causation in the sense we care about; we can’t treat the gold stars as an exogenous variable.
No, the gold stars cause extra effort after they are given out. This is part of the hypothetical.
The pupils work harder after they are given a gold star because they see their good work is appreciated. But if the gold stars are given out willy-nilly, then the pupils no longer feel proud to get one, and so they lose their ability to make students work harder.
As Robert Lucas would put it, the relationship is not robust to changes in the policy regime.
If the gold stars are what is causing the hard work in the hypothetical, then the hypothetical policy of giving out more gold stars would work. If giving out more gold stars doesn’t improve work, then distribute the gold stars to the students who do the least work- that ones with the most room to improve.
If, on the other hand, gold stars are a proxy for recognition, then students who want recognition have an extra incentive to work hard. Giving out more gold stars dilutes the effect, and distributing them according to some other criteria than ‘who put in hard work on this assignment’ also reduces the effect. The cause of the extra hard work isn’t the gold stars, but the method by which gold stars are distributed.
Awesome idea.
As far as I understand it, if variables A and B are correlated, then we can be pretty damn sure that either:
A causes B
B causes A
there’s a third variable affecting both A and B.
(Am I right about this or is this an oversimplification?)
A good way to grab attention might be to deny a commonly believed fact in a way that promises intelligent elaboration. So the website could start with a huge ‘Correlation does not imply causation’ banner and then go like ‘well, actually, it kind of does’. And then explain how going from not knowing anything at all to knowing that one of three causal hypotheses is correct is pretty damn informative even if we don’t immediately know which of the hypotheses is correct.
Then it would probably be useful to go all Bayesian and talk about priors, Ockham’s razor and how it’s a rare situation where we cannot distinguish between hypotheses at all. A good example might be to tell the story of how R. A. Fisher used the ‘correlation does not imply causation’ platitude to shoot down research connecting smoking to lung cancer and explain that it should have been clear that the hypothesis ‘smoking causes cancer’ was much more reasonable at that time than the hypothesis ‘there’s a common factor causing both smoking and cancer’. (On the other hand, this could turn political. I don’t know whether the smoking and lung cancer issue is still contested.)
There is in fact a d) A and not-B both can cause some condition C that defines our sample.
Example: Sexy people are more likely to be hired as actors. Good actors are also more likely to be hired as actors. So if we look at “people who are actors,” then we’ll get people who are sexy but can’t really act, people who are sexy and can act, and people who can act and aren’t really sexy. If sexiness and acting ability are independent, these three groups will be about equally full.
Thus if we look at actors in general in our simple model, 2⁄3 of them will be sexy and 2⁄3 of them will be good actors. But of the ones who are sexy, only 1⁄2 will be good actors. So being sexy is correlated with being a bad actor! Not because sexiness rots your brain (a), or because acting well makes you ugly (b), and not because acting classes cause both good acting and ugliness, or diet pills cause both beauty and bad acting (c). Instead, it’s just because how we picked actors made sexiness and acting ability “compete for the same niche.”
Similar examples would be sports and academics in college, different sorts of skills in people promoted in the workplace, UI design versus functionality in popular programs, and so on and so on.
I feel like this example should go on the doesnotimply website.
If you are familiar with d-separation (http://en.wikipedia.org/wiki/D-separation#d-separation), we have:
if A is dependent on B, and there’s some unobserved C involved, then:
(1) A ← C → B, or
(2) A → C → B, or
(3) A ← C ← B
(this is Reichenbach’s common cause principle: http://plato.stanford.edu/entries/physics-Rpcc/)
or
(4) A → C ← B
if C or its effect attains a particular (not necessarily recorded) value. Statisticians know this as Berkson’s bias, which is a form of selection bias. In AI, this is known as “explaining away.” Manfred’s excellent example falls into category (4), with C observed to equal “hired as actor.”
Beware: d-separation applies to causal graphical models, and Bayesian networks (which are statistical and not causal models). The meaning of arrows is different in these two kinds of models. This is actually a fairly subtle issue.
Odd—I always felt like d-separation was the same thing on causal diagrams and on Bayes networks. Although, I also understood Bayes network as being a model of the causal directions in a situation, so perhaps that’s why.
Manfred’s excellent example needs equally excellent counterparts for other possibilities.
Sorry for not being clear. The d-separation criterion is the same in both Bayesian networks and causal diagrams, but its meaning is not the same. This is because an arrow A → B in a causal diagram means (loosely) that A is a direct cause of B at the level of granularity of the model, while an arrow A → B in a Bayesian network has a more complicated to explain meaning having to do with the Markov factorization and conditional independence. D-separation talks about arrows in both cases, but asserts different things due to a difference in the meaning of those arrows.
A Bayesian network model is just a statistical model (a set of joint distributions) associated with a directed acyclic graph. Specifically it’s all distributions p(x1, …, xk) that factorize as a product of terms of the form p(x_i | parents(x_i)). Nothing more, nothing less. Nothing about causality in that definition.
I think examples for (1),(2),(3) are simpler than Manfred’s Berkson’s bias example.
(1) A ← C → B
Most clearly non-causal associations go here: “shoe size correlates with IQ” and its kin.
(2) A → C → B, and (3) A ← C ← B
Classic scientific triumphs go here: “smoking causes cancer.” Of note here is that if we can find an observable unconfounded C that intercepts all/most of the causal pathway, this is extremely valuable for estimating effects. If you can design an experiment with such a C, you don’t even have to randomize A.
That’s known as Berkson’s paradox.
I first heard of this idea a few months ago in a blog post at The Atlantic.
Aha, yes—which I think I in turn was linked to by Ben Goldacre. But the reason I was quickly able to enumerate this as a separate kind of correlation is because the causal graph is different, which would be Judea Pearl.
Yup. I’m reading the link from this post and just got to the discussion of Berkson’s paradox, which seems to be the same effect.
What do you mean by “equally full”?
I mean “I’m about to pretend that ‘sexy’ and ‘good actor’ are binary variables centered to make the math super easy.” If you would like less pretending, read the Atlantic article linked by a thoughtful replier, since the author draws the nice graph to prove the general case.
I wouldn’t like less pretending and ‘sexy’/‘good actor’ being binary variables is fine with me (and I understand your comment overall), but still I don’t know what does it mean that the groups are equally full. (Equal size? That doesn’t follow from independence.)
Right, so I make the math-light but false assumption that casting directors will take above-average applicants, and also that you aren’t more likely to eventually become an actor if you’re sexy and can act well.
If you mean “above median”, I see.
There’s also e): A causes B within our sample, but A does not cause B generally, or in the sense that we care about.
For example, suppose a teacher gives out a gold star whenever a pupil does a good piece of work, and this causes the pupil to work harder. Suppose also that this effect is greatest on mediocre pupils and least on the best pupils—but the best pupils get most of the gold stars, naturally.
Now suppose an educational researcher observes the class, and notes the correlation between receiving a gold star, and increased effort. This is genuine causation. He then concludes that the teacher should give out more gold stars, regardless of whether the pupil does a good piece of work or not, and focus the stars on mediocre pupils. This change made, the gold stars no longer cause increased effort. The causation disappears! Changing the way the teacher hands out the gold stars changes the relationship between gold stars and effort. So although there was genuine causation in the original sample, there is no general causation, or causation in the sense we care about; we can’t treat the gold stars as an exogenous variable.
See also the Lucas Critique.
That’s because you have cause and effect reversed: The extra effort causes the gold stars, not the other way around.
No, the gold stars cause extra effort after they are given out. This is part of the hypothetical.
The pupils work harder after they are given a gold star because they see their good work is appreciated. But if the gold stars are given out willy-nilly, then the pupils no longer feel proud to get one, and so they lose their ability to make students work harder.
As Robert Lucas would put it, the relationship is not robust to changes in the policy regime.
If the gold stars are what is causing the hard work in the hypothetical, then the hypothetical policy of giving out more gold stars would work. If giving out more gold stars doesn’t improve work, then distribute the gold stars to the students who do the least work- that ones with the most room to improve.
If, on the other hand, gold stars are a proxy for recognition, then students who want recognition have an extra incentive to work hard. Giving out more gold stars dilutes the effect, and distributing them according to some other criteria than ‘who put in hard work on this assignment’ also reduces the effect. The cause of the extra hard work isn’t the gold stars, but the method by which gold stars are distributed.