A: People who drink water inevitably end up dead. Therefore drinking water causes death.
B: No, that is correlation, not causation.
C: No, it is not correlation. To calculate correlation you divide the covariance of the two variables by the variance of each of the variables. In this case there is no variance in either variable, so you’re dividing by zero, so correlation is not even defined.
I think it’s an improvement to go from saying “there is obviously something wrong with A’s argument” to actually being able to point out the divide-by-zero in the equation.
Only to score points at the expense of the audience’s vocabulary would one say “there is no variance in either variable” as opposed to saying “there are no people who avoid drinking water, nor people who don’t end up dead, to compare to”.
“there are no people who avoid drinking water, nor people who don’t end up dead, to compare to”
This is perfectly good common-sense reasoning but doesn’t make the point that the correlation is undefined. I would’ve thought the audience for this, ie people involved in a correlation versus causation debate, would benefit from seeing that explicitly, if they don’t already. Maybe we are judging the audience differently. If we assume that everyone knows dividing by zero is bad, but don’t use any other technical term (including variance), maybe we can get the point across.
This is perfectly good common-sense reasoning that explains why the correlation is undefined. If your audience has any notion whatsoever of what correlation means, they will understand this. If not, trying to phrase the same argument in terms of math will not help; it will just make it impossible for your audience to engage with your argument.
If the audience is mathematically sophisticated, then writing out the formula for Pearson’s correlation coefficient is just going to distract them from the real issue, which is that the saying should refer to statistical dependence, rather than correlation. In other words, C’s argument only addresses the literal meaning of B’s words, not the substance behind them.
I acknowledge that using the wrong terminology to the wrong audience will make their eyes glaze over and be counter-productive.
If your audience has any notion whatsoever of what correlation means, they will understand this.
I disagree about that. Until I actually took a course in statistics, I wouldn’t have been sure whether the correlation was undefined or just misleading in that case. Again, I agree that not everyone needs this level of precision.
the real issue, which is that the saying should refer to statistical dependence, rather than correlation.
An important issue, but a completely different one. If B said “that is statistical dependence, not causation”, wouldn’t they be equally wrong in exactly the same way?
If B said “that is statistical dependence, not causation”, wouldn’t they be equally wrong in exactly the same way?
B would be wrong in the exact same way. So the true reason that B is wrong needs to apply in both cases. On the other hand, appealing to the correlation formula only defeats the correlation version of the argument.
Disagree. Our target audience—humans—rarely if ever thinks of ‘correlation’ in terms of its mathematical definition and I suspect would be put off by an attempt to do so.
This is entirely true—as a mere human, my interest plummeted at “covariance”, and I’d still like to think I’m SOMEWHAT equipped to handle correlation/causation. Just not numerically. So, as a roughly average human, I say your suspicions are correct.
The point still applies. What do you mean by “correlation”—formally or informally—when one (or both) of the variables is constant across the population?
The specific fake argument used is flawed because of that. When people make the correlation-causation error, how often are they doing it based off of a variable that’s constant across the population? Do people ever really develop ‘drinking water causes x’ beliefs?
It’s a valid point and very true, but I suspect that it isn’t applicable to the issue at hand.
Because babies drink less than adults. The lifetime average water consumption of people who die as infants is tiny compared to the lifetime average of adults.
Oops—sign mix-up in my mind when I wrote this. I meant the opposite—that I guessed that water consumption rate is negatively correlated with mortality rate.
A discussion I had in the reddit comments on that Slate post made me invent this fake argument:
A: People who drink water inevitably end up dead. Therefore drinking water causes death.
B: No, that is correlation, not causation.
C: No, it is not correlation. To calculate correlation you divide the covariance of the two variables by the variance of each of the variables. In this case there is no variance in either variable, so you’re dividing by zero, so correlation is not even defined.
I think it’s an improvement to go from saying “there is obviously something wrong with A’s argument” to actually being able to point out the divide-by-zero in the equation.
If you don’t drink water, you still die—that sounds pretty uncorrelated to me.
Only to score points at the expense of the audience’s vocabulary would one say “there is no variance in either variable” as opposed to saying “there are no people who avoid drinking water, nor people who don’t end up dead, to compare to”.
Let’s not encourage this.
This is perfectly good common-sense reasoning but doesn’t make the point that the correlation is undefined. I would’ve thought the audience for this, ie people involved in a correlation versus causation debate, would benefit from seeing that explicitly, if they don’t already. Maybe we are judging the audience differently. If we assume that everyone knows dividing by zero is bad, but don’t use any other technical term (including variance), maybe we can get the point across.
This is perfectly good common-sense reasoning that explains why the correlation is undefined. If your audience has any notion whatsoever of what correlation means, they will understand this. If not, trying to phrase the same argument in terms of math will not help; it will just make it impossible for your audience to engage with your argument.
If the audience is mathematically sophisticated, then writing out the formula for Pearson’s correlation coefficient is just going to distract them from the real issue, which is that the saying should refer to statistical dependence, rather than correlation. In other words, C’s argument only addresses the literal meaning of B’s words, not the substance behind them.
I acknowledge that using the wrong terminology to the wrong audience will make their eyes glaze over and be counter-productive.
I disagree about that. Until I actually took a course in statistics, I wouldn’t have been sure whether the correlation was undefined or just misleading in that case. Again, I agree that not everyone needs this level of precision.
An important issue, but a completely different one. If B said “that is statistical dependence, not causation”, wouldn’t they be equally wrong in exactly the same way?
B would be wrong in the exact same way. So the true reason that B is wrong needs to apply in both cases. On the other hand, appealing to the correlation formula only defeats the correlation version of the argument.
Ah, I see what you mean. You’re right.
Disagree. Our target audience—humans—rarely if ever thinks of ‘correlation’ in terms of its mathematical definition and I suspect would be put off by an attempt to do so.
This is entirely true—as a mere human, my interest plummeted at “covariance”, and I’d still like to think I’m SOMEWHAT equipped to handle correlation/causation. Just not numerically. So, as a roughly average human, I say your suspicions are correct.
The point still applies. What do you mean by “correlation”—formally or informally—when one (or both) of the variables is constant across the population?
The specific fake argument used is flawed because of that. When people make the correlation-causation error, how often are they doing it based off of a variable that’s constant across the population? Do people ever really develop ‘drinking water causes x’ beliefs?
It’s a valid point and very true, but I suspect that it isn’t applicable to the issue at hand.
Correlate water consumption rate with lifespan, to get a correlation. My guess is it will be negative.
Why? (EDIT: I guess people in warmer countries tend to drink more water but to have worse health; is that what you’re thinking about?)
Because babies drink less than adults. The lifetime average water consumption of people who die as infants is tiny compared to the lifetime average of adults.
In other words, death prevents drinking water.
Oops—sign mix-up in my mind when I wrote this. I meant the opposite—that I guessed that water consumption rate is negatively correlated with mortality rate.