Why don’t you take his model, as described, and see if it really produces this effect?
For the first disagreement, that’s a disagreement about his commentary on his second figure. I don’t have the data to actually calculate the correlation there, but eyeballing it the groups look like they don’t have a positive relationship between education and income anywhere near that of the larger group.
The second disagreement is on interpretation. If you add noise in both dimensions to a multivariate Gaussian model with mean differences between groups, then that impacts any slice of the model (modified by the angle between the mean difference vector and the slice vector). If one subgroup is above and to the right of the other subgroup, that means it’s above for every vertical slice and to the right for every horizontal slice. (On northwest-southeast slices, there’s no mean difference between the distributions, just population size differences, and the mean difference is maximized on the northeast-southwest slice.)
The particular slicing used in this effect- looking at each vertical slice individually, and each horizontal slice individually- seems reasonable, except that in the presence of mean differences it behaves as a filter that preserves the NW-SE noise!
The grandparent was wrong before I edited it, where I speculated that the noise had to be negatively correlated. That’s the claim that the major axis of the covariance ellipse has to be oriented a particular direction, but that was an overreach, as you see the reverse regression effect if there is any noise along the NW-SE axis. Take a look at Yan’s first figure- it has noise in both blues and greens, but it’s one-dimensional noise going NE-SW, and so we don’t see reverse regression.
My original thought (when I thought you might need the major axis to be NW-SE, rather than just the NW-SE axis to be nonzero) was that this was just a reversal effect, with the noise providing the reversing factor. That’s still true but I’m surprised at how benign the restrictions on the noise are.
That is, I disagree with Yan that this has a different origin than Simpson’s Paradox, but I agree with Yan that this is an important example of how pernicious reversal effects are, and that noise generates them by default, in some sense. I would demonstrate it with a multivariate Gaussian where the blue mean is [6 6], the green mean is [4 4], and the covariance matrix is [1 .5; .5 1], so that it’s obvious that the dominant relationship for each group is a positive relationship between education and income but the NW-SE relationship exists and these slices make it visible.
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive).
The main line I’m thinking of is:
the data is telling a very simple story, which is that A) blue men are more educated and B) more educated people get paid more.
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
I don’t quite see the difference in interpretation from this writing.
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.
For the first disagreement, that’s a disagreement about his commentary on his second figure. I don’t have the data to actually calculate the correlation there, but eyeballing it the groups look like they don’t have a positive relationship between education and income anywhere near that of the larger group.
The second disagreement is on interpretation. If you add noise in both dimensions to a multivariate Gaussian model with mean differences between groups, then that impacts any slice of the model (modified by the angle between the mean difference vector and the slice vector). If one subgroup is above and to the right of the other subgroup, that means it’s above for every vertical slice and to the right for every horizontal slice. (On northwest-southeast slices, there’s no mean difference between the distributions, just population size differences, and the mean difference is maximized on the northeast-southwest slice.)
The particular slicing used in this effect- looking at each vertical slice individually, and each horizontal slice individually- seems reasonable, except that in the presence of mean differences it behaves as a filter that preserves the NW-SE noise!
The grandparent was wrong before I edited it, where I speculated that the noise had to be negatively correlated. That’s the claim that the major axis of the covariance ellipse has to be oriented a particular direction, but that was an overreach, as you see the reverse regression effect if there is any noise along the NW-SE axis. Take a look at Yan’s first figure- it has noise in both blues and greens, but it’s one-dimensional noise going NE-SW, and so we don’t see reverse regression.
My original thought (when I thought you might need the major axis to be NW-SE, rather than just the NW-SE axis to be nonzero) was that this was just a reversal effect, with the noise providing the reversing factor. That’s still true but I’m surprised at how benign the restrictions on the noise are.
That is, I disagree with Yan that this has a different origin than Simpson’s Paradox, but I agree with Yan that this is an important example of how pernicious reversal effects are, and that noise generates them by default, in some sense. I would demonstrate it with a multivariate Gaussian where the blue mean is [6 6], the green mean is [4 4], and the covariance matrix is [1 .5; .5 1], so that it’s obvious that the dominant relationship for each group is a positive relationship between education and income but the NW-SE relationship exists and these slices make it visible.
Hi Vaniver! =D
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
The main line I’m thinking of is:
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.