if variables A and B are correlated, then we can be pretty damn sure that either: a) A causes B b) B causes A c) there’s a third variable affecting both A and B.
There is in fact a d) A and not-B both can cause some condition C that defines our sample.
Example: Sexy people are more likely to be hired as actors. Good actors are also more likely to be hired as actors. So if we look at “people who are actors,” then we’ll get people who are sexy but can’t really act, people who are sexy and can act, and people who can act and aren’t really sexy. If sexiness and acting ability are independent, these three groups will be about equally full.
Thus if we look at actors in general in our simple model, 2⁄3 of them will be sexy and 2⁄3 of them will be good actors. But of the ones who are sexy, only 1⁄2 will be good actors. So being sexy is correlated with being a bad actor! Not because sexiness rots your brain (a), or because acting well makes you ugly (b), and not because acting classes cause both good acting and ugliness, or diet pills cause both beauty and bad acting (c). Instead, it’s just because how we picked actors made sexiness and acting ability “compete for the same niche.”
Similar examples would be sports and academics in college, different sorts of skills in people promoted in the workplace, UI design versus functionality in popular programs, and so on and so on.
if C or its effect attains a particular (not necessarily recorded) value. Statisticians know this as Berkson’s bias, which is a form of selection bias. In AI, this is known as “explaining away.” Manfred’s excellent example falls into category (4), with C observed to equal “hired as actor.”
Beware: d-separation applies to causal graphical models, and Bayesian networks (which are statistical and not causal models). The meaning of arrows is different in these two kinds of models. This is actually a fairly subtle issue.
Odd—I always felt like d-separation was the same thing on causal diagrams and on Bayes networks. Although, I also understood Bayes network as being a model of the causal directions in a situation, so perhaps that’s why.
Manfred’s excellent example needs equally excellent counterparts for other possibilities.
Sorry for not being clear. The d-separation criterion is the same in both Bayesian networks and causal diagrams, but its meaning is not the same. This is because an arrow A → B in a causal diagram means (loosely) that A is a direct cause of B at the level of granularity of the model, while an arrow A → B in a Bayesian network has a more complicated to explain meaning having to do with the Markov factorization and conditional independence. D-separation talks about arrows in both cases, but asserts different things due to a difference in the meaning of those arrows.
A Bayesian network model is just a statistical model (a set of joint distributions) associated with a directed acyclic graph. Specifically it’s all distributions p(x1, …, xk) that factorize as a product of terms of the form p(x_i | parents(x_i)). Nothing more, nothing less. Nothing about causality in that definition.
I think examples for (1),(2),(3) are simpler than Manfred’s Berkson’s bias example.
(1) A ← C → B
Most clearly non-causal associations go here: “shoe size correlates with IQ” and its kin.
(2) A → C → B, and
(3) A ← C ← B
Classic scientific triumphs go here: “smoking causes cancer.” Of note here is that if we can find an observable unconfounded C that intercepts all/most of the causal pathway, this is extremely valuable for estimating effects. If you can design an experiment with such a C, you don’t even have to randomize A.
Aha, yes—which I think I in turn was linked to by Ben Goldacre. But the reason I was quickly able to enumerate this as a separate kind of correlation is because the causal graph is different, which would be Judea Pearl.
I mean “I’m about to pretend that ‘sexy’ and ‘good actor’ are binary variables centered to make the math super easy.” If you would like less pretending, read the Atlantic article linked by a thoughtful replier, since the author draws the nice graph to prove the general case.
I wouldn’t like less pretending and ‘sexy’/‘good actor’ being binary variables is fine with me (and I understand your comment overall), but still I don’t know what does it mean that the groups are equally full. (Equal size? That doesn’t follow from independence.)
Right, so I make the math-light but false assumption that casting directors will take above-average applicants, and also that you aren’t more likely to eventually become an actor if you’re sexy and can act well.
There is in fact a d) A and not-B both can cause some condition C that defines our sample.
Example: Sexy people are more likely to be hired as actors. Good actors are also more likely to be hired as actors. So if we look at “people who are actors,” then we’ll get people who are sexy but can’t really act, people who are sexy and can act, and people who can act and aren’t really sexy. If sexiness and acting ability are independent, these three groups will be about equally full.
Thus if we look at actors in general in our simple model, 2⁄3 of them will be sexy and 2⁄3 of them will be good actors. But of the ones who are sexy, only 1⁄2 will be good actors. So being sexy is correlated with being a bad actor! Not because sexiness rots your brain (a), or because acting well makes you ugly (b), and not because acting classes cause both good acting and ugliness, or diet pills cause both beauty and bad acting (c). Instead, it’s just because how we picked actors made sexiness and acting ability “compete for the same niche.”
Similar examples would be sports and academics in college, different sorts of skills in people promoted in the workplace, UI design versus functionality in popular programs, and so on and so on.
I feel like this example should go on the doesnotimply website.
If you are familiar with d-separation (http://en.wikipedia.org/wiki/D-separation#d-separation), we have:
if A is dependent on B, and there’s some unobserved C involved, then:
(1) A ← C → B, or
(2) A → C → B, or
(3) A ← C ← B
(this is Reichenbach’s common cause principle: http://plato.stanford.edu/entries/physics-Rpcc/)
or
(4) A → C ← B
if C or its effect attains a particular (not necessarily recorded) value. Statisticians know this as Berkson’s bias, which is a form of selection bias. In AI, this is known as “explaining away.” Manfred’s excellent example falls into category (4), with C observed to equal “hired as actor.”
Beware: d-separation applies to causal graphical models, and Bayesian networks (which are statistical and not causal models). The meaning of arrows is different in these two kinds of models. This is actually a fairly subtle issue.
Odd—I always felt like d-separation was the same thing on causal diagrams and on Bayes networks. Although, I also understood Bayes network as being a model of the causal directions in a situation, so perhaps that’s why.
Manfred’s excellent example needs equally excellent counterparts for other possibilities.
Sorry for not being clear. The d-separation criterion is the same in both Bayesian networks and causal diagrams, but its meaning is not the same. This is because an arrow A → B in a causal diagram means (loosely) that A is a direct cause of B at the level of granularity of the model, while an arrow A → B in a Bayesian network has a more complicated to explain meaning having to do with the Markov factorization and conditional independence. D-separation talks about arrows in both cases, but asserts different things due to a difference in the meaning of those arrows.
A Bayesian network model is just a statistical model (a set of joint distributions) associated with a directed acyclic graph. Specifically it’s all distributions p(x1, …, xk) that factorize as a product of terms of the form p(x_i | parents(x_i)). Nothing more, nothing less. Nothing about causality in that definition.
I think examples for (1),(2),(3) are simpler than Manfred’s Berkson’s bias example.
(1) A ← C → B
Most clearly non-causal associations go here: “shoe size correlates with IQ” and its kin.
(2) A → C → B, and (3) A ← C ← B
Classic scientific triumphs go here: “smoking causes cancer.” Of note here is that if we can find an observable unconfounded C that intercepts all/most of the causal pathway, this is extremely valuable for estimating effects. If you can design an experiment with such a C, you don’t even have to randomize A.
That’s known as Berkson’s paradox.
I first heard of this idea a few months ago in a blog post at The Atlantic.
Aha, yes—which I think I in turn was linked to by Ben Goldacre. But the reason I was quickly able to enumerate this as a separate kind of correlation is because the causal graph is different, which would be Judea Pearl.
Yup. I’m reading the link from this post and just got to the discussion of Berkson’s paradox, which seems to be the same effect.
What do you mean by “equally full”?
I mean “I’m about to pretend that ‘sexy’ and ‘good actor’ are binary variables centered to make the math super easy.” If you would like less pretending, read the Atlantic article linked by a thoughtful replier, since the author draws the nice graph to prove the general case.
I wouldn’t like less pretending and ‘sexy’/‘good actor’ being binary variables is fine with me (and I understand your comment overall), but still I don’t know what does it mean that the groups are equally full. (Equal size? That doesn’t follow from independence.)
Right, so I make the math-light but false assumption that casting directors will take above-average applicants, and also that you aren’t more likely to eventually become an actor if you’re sexy and can act well.
If you mean “above median”, I see.