Understanding Simpson’s Paradox
An article by Judea Pearl, available here. It’s quick at 8 pages, and worth reading if you enjoy statistics (though I think people who already are familiar with the math of causality1 will get more out of it than others2). I’ll talk here about the part that I think is generally interesting:
Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson’s paradox and, naturally, will proceed to convince readers that the paradox deserves the title “resolved.”
I’ve never really liked the name “paradox,” because what it seems to mean is “unintuitive phenomenon.” (Wikipedia puts it as “something which seems false and yet might be true.”) The trouble is that “unintuitive” is a two-place word, and it makes sense to think like reality, so that true things seem true to you, instead of still seeming false. (For example, when I first learned about Zeno’s Paradox, I already knew calculus, and so Zeno’s position was the one that seemed confusing and false.)
What I like most about Pearl’s article is that it explicitly recognizes the importance of fully dissolving the paradox,3 and seems to do so. Simpson’s Paradox isn’t an unsolvable problem in statistics, it’s a straightforward reversal effect—only if you use the language of causality.
1. My review of Causality gives a taste of what it would look like to be familiar with the math, but you’d need to actually read the book to pick it up. The Highly Advanced Epistemology 101 for Beginners sequence is relevant, and contains Eliezer’s attempt to explain the basics of causality in Causal Diagrams and Causal Models.
2. Pearl discusses how you would go about using simulations to show that do calculus gives you the right result, but leaves it as an exercise for the reader.
3. How An Algorithm Feels From Inside is probably a better place to start than Dissolving the Question, and I can’t help but echo a question from it: “So what kind of math design corresponds to [Simpson’s Paradox]?”
See also: bentarm’s explanation of Simpson’s Paradox.
- Interactive Infographic on Simpson’s Paradox by 20 Sep 2013 17:37 UTC; 39 points) (
- 15 Oct 2013 21:10 UTC; 8 points) 's comment on How habits work and how you may control them by (
- 3 Mar 2015 16:21 UTC; 2 points) 's comment on Open thread, Mar. 2 - Mar. 8, 2015 by (
I have a question that I can’t work out. From Pearl’s Causality book (the 2000 version with the excellent commentary in the back), I read on page 356:
My problem is that I cannot imagine world in which men earn more than equally qualified women, men are more qualified than equally paid women, and that more qualified men (respectively, women) are paid more than more qualified men (respectively, women). There does not appear to be such a set of points in the space (Wages) x (Qualifications) x (Genders) where all of these conditions hold true. Since Pearl asserts the first two, do I have to get rid of the idea that more qualifications lead to more pay? I can’t see any other way out of the bind.
(My reasoning for why this appears to be impossible: start with the assumption of the first two conditions (i.e. Pearl’s assertions). Consider a man of some qualifications and pay. A woman A as qualified as him earn less. A woman B who earn as much as him are more qualified. But the slope of the qualifications-wages line between woman A and woman B goes the wrong way for qualifications to be positively correlated to wages—the less qualified woman earns more! So if this is possible, there’s something quite unintuitive going on with the distributions.)
Let’s take a world with 10 people and 4 jobs:
Engineer (high-education, high-pay): 2 men and 1 woman
Teacher (high-education, low-pay): 1 man and 1 woman
Plumber (low-education, high-pay): 1 man and 1 woman
Cleaner (low-education, low-pay): 1 man and 2 women
If you control for education:
50% of uneducated men have high-paying jobs, versus 33% of uneducated women
66% of educated men have high-paying jobs, versus 50% of educated women
… and if you control for pay:
66% of high-salary men are educated, versus 50% of high-salary women
50% of low-salary men are educated, versus 33% of low-salary women
You can also check that for both men and women, income and education are correlated.
Small numbers that illustrate the case. Perfect!
I believe so, with the caveat that this could be a reversal effect. That is, qualifications and pay may be positively correlated for the whole group because men have more of both than women, while for each subgroup the correlations are negative.
Consider the following situation:
Men have 60 points to spend at character creation. Each point can either be used on a year of schooling, or a dollar of salary, with a minimum of 10 in each.
Women have 30 points to spend at character creation. Each point can either be used on a year of schooling, or a dollar of salary, with a minimum of 10 in each.
Now Bob says, “Look! If we look at groups determined by salary, each man is more qualified than women in his cohort, by thirty years of schooling.” Barbara says, “Look! If we look at groups determined by schooling, each woman earns less than men in her cohort, by thirty dollars.”
If most people choose to spend their points equally, then the population will be dominated by the points (15,15) and (30,30), and so the Association of Higher Education will say “Look! Schooling and salary are positively correlated.”
The causal diagram in this situation is clear, though: it’s being male that leads to more points while the direct effect of schooling on salary is negative because those two come from the same pool of points.
That’s a great explanation, thanks for writing it! From now on, I will use your explanation instead of mine.
Thanks! (I am amused that the linked explanation includes evidence of Vaniver_2010 being confused by Simpson’s Paradox.)
Thanks for the insightful comment. I hadn’t considered that particular application of Simpson’s paradox. But really, I don’t think this is that likely, is it? I mean, you’re letting me get one statement I like “qualifications correlate with earnings in general” but give up two statements that I find likely: “qualification correlate with earnings for males (resp. females)”.
This paper looks like it says that qualifications are correlated with earnings for each subgroup. See the tables on pages 21 and 22. I say “looks like” since I haven’t actually read it and just skipped to the tables. I hope to get a chance to look at it more in depth soon.
I think that particular reversal is probably unlikely in general, but I can think of several plausible cases when it would exist.
Suppose that IQ positively impacts both education and income. But education has a negative effect on income, because the more educated someone is, the more they will choose to work on abstract tasks which don’t pay as highly. (A salesman earns more than mathematician, say, and the primary function of education is to convince some people that mathematicians are higher status than salesmen.) It looks like the impact of education on income is positive, because of the effect of IQ. (This is basically the same as the reversal effect we discussed, except swapping out sex for IQ.)
Suppose among workers in general, qualification has a positive impact on earnings. For one particular sex at one particular firm, the selection process might be such that qualification has a negative impact on earnings. For small firms in particular, this situation might be likely to arise by chance.
I guess the trick is that, if you’re using a standard least-squares fitting to find your regressions, the linear fit that you get by minimizing the sum of squared errors in one variable is not the same as the linear fit that you get by minimizing the sum of squared errors in the other variable. So as long as the true data isn’t a simple line, but rather a noisy distribution or a nonlinear relation, you can get different pairs of lines depending on which minimization problem you solve.
This discussion of the problem is a little less hand-wavy than my above guess, it includes a (visual) example of the paradox and it seems to agree that having noisy data is critical to the problem. Oh, and it seems to have been written by a LessWrong reader. I thought that lingo looked oddly familiar.
Thanks, that dicussions’s examples were exactly what I was looking for!
I disagree with Yan’s discussion there on two (closely related) points.
First, I don’t think his claim B (that “more educated people get paid more”) is correct, because note that it’s not actually true for greens and blues separately. That is, it might be true that the total effect is that education leads to higher pay, but it’s not true that the direct effect is that education leads to higher pay, which is the same as my model in this comment. It looks to me like the direct effect of education on income is neutral or negative but the indirect effect (through color) is positive. (I have some training in estimating correlations from looking at graphs, but actually computing it would be better.)
Second, that suggests to me that this is a garden-variety reversal effect (i.e. Simpson’s Paradox), so I disagree with his claim that it differs in origin from Simpson’s Paradox.
The core of this disagreement is what the conditions are on the noise. I think that the noise needs to be negatively correlated (in two dimensions, the major axis of the ellipse runs northwest-southwest) to allow this effect (which means it’s just an reversal effect obscured by noise). If it’s possible to get this effect with uncorrelated noise (in 2d, the noise is circular), or positively correlated noise (the major axis of the ellipse runs northeast-southwest), then I’m wrong and this is its own effect, but I don’t see a way to do that.
[edit] Actually, now that I say this, I think I can construct a distribution that does this by hiding another reversal in the noise. Suppose you have a dense line of greens from (0,0) to (2,2), and a dense line of blues from (2,2) to (4,4). Then also add in a sparse scattering of both colors by picking a point on the line, generating a number from N(0,1), and then subtracting that from the x coordinate and adding it to the y coordinate. Each blue on the line will only be compared with greens that are off the line, which are either below or to the left of it. Each green on the line will only be compared with blues that are off the line, which will be either above or to the right of it.
But you can see here that the work making the ‘discrimination’ effect of reverse regression is still being done by the negatively correlated noise (‘preference’, say), not the positively correlated noise (‘merit’, say), even though we can construct a model where merit swamps preference for each group individually (so it appears the direct effect of education on pay is positive, for each group and subgroup but not for each level of merit).
Your explicit example of a distribution is quite helpful and is exactly what I was looking for to help me visualize this system. Thank you.
You seem to be saying that Yan’s description of his model is incorrect. Why don’t you take his model, as described, and see if it really produces this effect?
For the first disagreement, that’s a disagreement about his commentary on his second figure. I don’t have the data to actually calculate the correlation there, but eyeballing it the groups look like they don’t have a positive relationship between education and income anywhere near that of the larger group.
The second disagreement is on interpretation. If you add noise in both dimensions to a multivariate Gaussian model with mean differences between groups, then that impacts any slice of the model (modified by the angle between the mean difference vector and the slice vector). If one subgroup is above and to the right of the other subgroup, that means it’s above for every vertical slice and to the right for every horizontal slice. (On northwest-southeast slices, there’s no mean difference between the distributions, just population size differences, and the mean difference is maximized on the northeast-southwest slice.)
The particular slicing used in this effect- looking at each vertical slice individually, and each horizontal slice individually- seems reasonable, except that in the presence of mean differences it behaves as a filter that preserves the NW-SE noise!
The grandparent was wrong before I edited it, where I speculated that the noise had to be negatively correlated. That’s the claim that the major axis of the covariance ellipse has to be oriented a particular direction, but that was an overreach, as you see the reverse regression effect if there is any noise along the NW-SE axis. Take a look at Yan’s first figure- it has noise in both blues and greens, but it’s one-dimensional noise going NE-SW, and so we don’t see reverse regression.
My original thought (when I thought you might need the major axis to be NW-SE, rather than just the NW-SE axis to be nonzero) was that this was just a reversal effect, with the noise providing the reversing factor. That’s still true but I’m surprised at how benign the restrictions on the noise are.
That is, I disagree with Yan that this has a different origin than Simpson’s Paradox, but I agree with Yan that this is an important example of how pernicious reversal effects are, and that noise generates them by default, in some sense. I would demonstrate it with a multivariate Gaussian where the blue mean is [6 6], the green mean is [4 4], and the covariance matrix is [1 .5; .5 1], so that it’s obvious that the dominant relationship for each group is a positive relationship between education and income but the NW-SE relationship exists and these slices make it visible.
Hi Vaniver! =D
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
The main line I’m thinking of is:
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.
I’m glad it was helpful. =)
Technically speaking, if both your variables (x AND y) have errors in them, the ordinary least-squares regression is the wrong methodology to use. See http://en.wikipedia.org/wiki/Errors-in-variables_models
You have a typo here, I think- suppose the man has qualification 2 and salary 2. A has qualification 2 and earns 1, but B has salary 2 and qualification 3. The line is positive. If B has salary 2 and qualification 1 (i.e. the man is more qualified, not the woman more qualified) then this matches the description and the line is negative.
Sometime ago I set out to create the simplest possible explanation of Simpson’s paradox, without any numbers at all. This was the result:
1) Imagine that most women who get some disease survive, while most men die.
2) Imagine that most women with the disease take a certain medicine, while most men don’t.
3) Imagine that the medicine has absolutely no effect. Women happen to buy it more because it’s marketed to women, and happen to die less for some unrelated physiological reason.
Now if you look at the population as a whole, you’ll see a strong correlation between taking the medicine and surviving. And even if the medicine has a weak negative effect, that won’t sway the correlation much.
Congratulations, now you have an general understanding of why slicing or merging data can introduce or remove meaningless correlations. You’ll never be able to read the press the same way again.