I guess the trick is that, if you’re using a standard least-squares fitting to find your regressions, the linear fit that you get by minimizing the sum of squared errors in one variable is not the same as the linear fit that you get by minimizing the sum of squared errors in the other variable. So as long as the true data isn’t a simple line, but rather a noisy distribution or a nonlinear relation, you can get different pairs of lines depending on which minimization problem you solve.
This discussion of the problem is a little less hand-wavy than my above guess, it includes a (visual) example of the paradox and it seems to agree that having noisy data is critical to the problem. Oh, and it seems to have been written by a LessWrong reader. I thought that lingo looked oddly familiar.
I disagree with Yan’s discussion there on two (closely related) points.
First, I don’t think his claim B (that “more educated people get paid more”) is correct, because note that it’s not actually true for greens and blues separately. That is, it might be true that the total effect is that education leads to higher pay, but it’s not true that the direct effect is that education leads to higher pay, which is the same as my model in this comment. It looks to me like the direct effect of education on income is neutral or negative but the indirect effect (through color) is positive. (I have some training in estimating correlations from looking at graphs, but actually computing it would be better.)
Second, that suggests to me that this is a garden-variety reversal effect (i.e. Simpson’s Paradox), so I disagree with his claim that it differs in origin from Simpson’s Paradox.
The core of this disagreement is what the conditions are on the noise. I think that the noise needs to be negatively correlated (in two dimensions, the major axis of the ellipse runs northwest-southwest) to allow this effect (which means it’s just an reversal effect obscured by noise). If it’s possible to get this effect with uncorrelated noise (in 2d, the noise is circular), or positively correlated noise (the major axis of the ellipse runs northeast-southwest), then I’m wrong and this is its own effect, but I don’t see a way to do that.
[edit] Actually, now that I say this, I think I can construct a distribution that does this by hiding another reversal in the noise. Suppose you have a dense line of greens from (0,0) to (2,2), and a dense line of blues from (2,2) to (4,4). Then also add in a sparse scattering of both colors by picking a point on the line, generating a number from N(0,1), and then subtracting that from the x coordinate and adding it to the y coordinate. Each blue on the line will only be compared with greens that are off the line, which are either below or to the left of it. Each green on the line will only be compared with blues that are off the line, which will be either above or to the right of it.
But you can see here that the work making the ‘discrimination’ effect of reverse regression is still being done by the negatively correlated noise (‘preference’, say), not the positively correlated noise (‘merit’, say), even though we can construct a model where merit swamps preference for each group individually (so it appears the direct effect of education on pay is positive, for each group and subgroup but not for each level of merit).
You seem to be saying that Yan’s description of his model is incorrect. Why don’t you take his model, as described, and see if it really produces this effect?
Why don’t you take his model, as described, and see if it really produces this effect?
For the first disagreement, that’s a disagreement about his commentary on his second figure. I don’t have the data to actually calculate the correlation there, but eyeballing it the groups look like they don’t have a positive relationship between education and income anywhere near that of the larger group.
The second disagreement is on interpretation. If you add noise in both dimensions to a multivariate Gaussian model with mean differences between groups, then that impacts any slice of the model (modified by the angle between the mean difference vector and the slice vector). If one subgroup is above and to the right of the other subgroup, that means it’s above for every vertical slice and to the right for every horizontal slice. (On northwest-southeast slices, there’s no mean difference between the distributions, just population size differences, and the mean difference is maximized on the northeast-southwest slice.)
The particular slicing used in this effect- looking at each vertical slice individually, and each horizontal slice individually- seems reasonable, except that in the presence of mean differences it behaves as a filter that preserves the NW-SE noise!
The grandparent was wrong before I edited it, where I speculated that the noise had to be negatively correlated. That’s the claim that the major axis of the covariance ellipse has to be oriented a particular direction, but that was an overreach, as you see the reverse regression effect if there is any noise along the NW-SE axis. Take a look at Yan’s first figure- it has noise in both blues and greens, but it’s one-dimensional noise going NE-SW, and so we don’t see reverse regression.
My original thought (when I thought you might need the major axis to be NW-SE, rather than just the NW-SE axis to be nonzero) was that this was just a reversal effect, with the noise providing the reversing factor. That’s still true but I’m surprised at how benign the restrictions on the noise are.
That is, I disagree with Yan that this has a different origin than Simpson’s Paradox, but I agree with Yan that this is an important example of how pernicious reversal effects are, and that noise generates them by default, in some sense. I would demonstrate it with a multivariate Gaussian where the blue mean is [6 6], the green mean is [4 4], and the covariance matrix is [1 .5; .5 1], so that it’s obvious that the dominant relationship for each group is a positive relationship between education and income but the NW-SE relationship exists and these slices make it visible.
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive).
The main line I’m thinking of is:
the data is telling a very simple story, which is that A) blue men are more educated and B) more educated people get paid more.
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
I don’t quite see the difference in interpretation from this writing.
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.
if you’re using a standard least-squares fitting to find your regressions, the linear fit that you get by minimizing the sum of squared errors in one variable is not the same as the linear fit that you get by minimizing the sum of squared errors in the other variable.
I guess the trick is that, if you’re using a standard least-squares fitting to find your regressions, the linear fit that you get by minimizing the sum of squared errors in one variable is not the same as the linear fit that you get by minimizing the sum of squared errors in the other variable. So as long as the true data isn’t a simple line, but rather a noisy distribution or a nonlinear relation, you can get different pairs of lines depending on which minimization problem you solve.
This discussion of the problem is a little less hand-wavy than my above guess, it includes a (visual) example of the paradox and it seems to agree that having noisy data is critical to the problem. Oh, and it seems to have been written by a LessWrong reader. I thought that lingo looked oddly familiar.
Thanks, that dicussions’s examples were exactly what I was looking for!
I disagree with Yan’s discussion there on two (closely related) points.
First, I don’t think his claim B (that “more educated people get paid more”) is correct, because note that it’s not actually true for greens and blues separately. That is, it might be true that the total effect is that education leads to higher pay, but it’s not true that the direct effect is that education leads to higher pay, which is the same as my model in this comment. It looks to me like the direct effect of education on income is neutral or negative but the indirect effect (through color) is positive. (I have some training in estimating correlations from looking at graphs, but actually computing it would be better.)
Second, that suggests to me that this is a garden-variety reversal effect (i.e. Simpson’s Paradox), so I disagree with his claim that it differs in origin from Simpson’s Paradox.
The core of this disagreement is what the conditions are on the noise. I think that the noise needs to be negatively correlated (in two dimensions, the major axis of the ellipse runs northwest-southwest) to allow this effect (which means it’s just an reversal effect obscured by noise). If it’s possible to get this effect with uncorrelated noise (in 2d, the noise is circular), or positively correlated noise (the major axis of the ellipse runs northeast-southwest), then I’m wrong and this is its own effect, but I don’t see a way to do that.
[edit] Actually, now that I say this, I think I can construct a distribution that does this by hiding another reversal in the noise. Suppose you have a dense line of greens from (0,0) to (2,2), and a dense line of blues from (2,2) to (4,4). Then also add in a sparse scattering of both colors by picking a point on the line, generating a number from N(0,1), and then subtracting that from the x coordinate and adding it to the y coordinate. Each blue on the line will only be compared with greens that are off the line, which are either below or to the left of it. Each green on the line will only be compared with blues that are off the line, which will be either above or to the right of it.
But you can see here that the work making the ‘discrimination’ effect of reverse regression is still being done by the negatively correlated noise (‘preference’, say), not the positively correlated noise (‘merit’, say), even though we can construct a model where merit swamps preference for each group individually (so it appears the direct effect of education on pay is positive, for each group and subgroup but not for each level of merit).
Your explicit example of a distribution is quite helpful and is exactly what I was looking for to help me visualize this system. Thank you.
You seem to be saying that Yan’s description of his model is incorrect. Why don’t you take his model, as described, and see if it really produces this effect?
For the first disagreement, that’s a disagreement about his commentary on his second figure. I don’t have the data to actually calculate the correlation there, but eyeballing it the groups look like they don’t have a positive relationship between education and income anywhere near that of the larger group.
The second disagreement is on interpretation. If you add noise in both dimensions to a multivariate Gaussian model with mean differences between groups, then that impacts any slice of the model (modified by the angle between the mean difference vector and the slice vector). If one subgroup is above and to the right of the other subgroup, that means it’s above for every vertical slice and to the right for every horizontal slice. (On northwest-southeast slices, there’s no mean difference between the distributions, just population size differences, and the mean difference is maximized on the northeast-southwest slice.)
The particular slicing used in this effect- looking at each vertical slice individually, and each horizontal slice individually- seems reasonable, except that in the presence of mean differences it behaves as a filter that preserves the NW-SE noise!
The grandparent was wrong before I edited it, where I speculated that the noise had to be negatively correlated. That’s the claim that the major axis of the covariance ellipse has to be oriented a particular direction, but that was an overreach, as you see the reverse regression effect if there is any noise along the NW-SE axis. Take a look at Yan’s first figure- it has noise in both blues and greens, but it’s one-dimensional noise going NE-SW, and so we don’t see reverse regression.
My original thought (when I thought you might need the major axis to be NW-SE, rather than just the NW-SE axis to be nonzero) was that this was just a reversal effect, with the noise providing the reversing factor. That’s still true but I’m surprised at how benign the restrictions on the noise are.
That is, I disagree with Yan that this has a different origin than Simpson’s Paradox, but I agree with Yan that this is an important example of how pernicious reversal effects are, and that noise generates them by default, in some sense. I would demonstrate it with a multivariate Gaussian where the blue mean is [6 6], the green mean is [4 4], and the covariance matrix is [1 .5; .5 1], so that it’s obvious that the dominant relationship for each group is a positive relationship between education and income but the NW-SE relationship exists and these slices make it visible.
Hi Vaniver! =D
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
The main line I’m thinking of is:
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.
I’m glad it was helpful. =)
Technically speaking, if both your variables (x AND y) have errors in them, the ordinary least-squares regression is the wrong methodology to use. See http://en.wikipedia.org/wiki/Errors-in-variables_models