On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive).
The main line I’m thinking of is:
the data is telling a very simple story, which is that A) blue men are more educated and B) more educated people get paid more.
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
I don’t quite see the difference in interpretation from this writing.
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.
Hi Vaniver! =D
On the commentary: your eyeballing seems good, but I don’t think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive). As you observed, I could easily make all 3 correlations (blue-only, green only, or blue+green) positive. I don’t have any interesting things to say about their relative degrees.
I don’t quite see the difference in interpretation from this writing. I agree with basically all the stuff you’ve written? The fact that the slicing “behaves as a filter”, if I interpret it correctly, is exactly the problem here.
I don’t know what “have a different origin than Simpson’s paradox” means exactly, but here are a few ways they differ and why I say they are “different”:
a fundamental assumption on Simpson’s paradox is that there’s some imbalance with the denominators; in your 2x2x2 matrix you can’t arbitrarily scale the numbers arbitrarily; all the examples you can construct almost relies on (let’s say we are using the familiar batting averages example) the fact that the denominators (row sums) are different.
the direct cause of the reversal effect is, as you said, the noise; I don’t think Simpson’s paradox has anything related to the noise.
Idea: my steel-man version of your argument is that reversal effects arise when you have inhomogenous data, and this is definitely the more general common problem in both situations. In that case I agree. (this is how I teach this class at SPARC, at least).
The main line I’m thinking of is:
I don’t think this story quite captures the data, because I can construct a model where both of these are true but we don’t get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don’t have this effect because equal education lines don’t have blues earning more than greens (they earn less; this is a straightforward ‘discrimination against blues’ story).
I would use the language of B to mean “in the three node model which has color, education, and income, the direct effect of education on income is positive,” which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean “in the two node model which has education and income, the direct effect of education on income is positive,” that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can’t change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the “this doesn’t show discrimination” claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).
It’s very possible I’ve imagined the difference / misunderstood what you’ve written. My appreciation of the filtering effect of the slices is also very recent, and I may think it’s more important as I think about it more.
It seems that I’m quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that’s why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)
In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson’s paradox and RRE, the underlying mechanism I see is that there’s a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there’s a fundamental difference between them, and I don’t see that difference- but I might have misread you, or overestimated what you think the size of the difference is.