Sometime ago I set out to create the simplest possible explanation of Simpson’s paradox, without any numbers at all. This was the result:
1) Imagine that most women who get some disease survive, while most men die.
2) Imagine that most women with the disease take a certain medicine, while most men don’t.
3) Imagine that the medicine has absolutely no effect. Women happen to buy it more because it’s marketed to women, and happen to die less for some unrelated physiological reason.
Now if you look at the population as a whole, you’ll see a strong correlation between taking the medicine and surviving. And even if the medicine has a weak negative effect, that won’t sway the correlation much.
Congratulations, now you have an general understanding of why slicing or merging data can introduce or remove meaningless correlations. You’ll never be able to read the press the same way again.
Sometime ago I set out to create the simplest possible explanation of Simpson’s paradox, without any numbers at all. This was the result:
1) Imagine that most women who get some disease survive, while most men die.
2) Imagine that most women with the disease take a certain medicine, while most men don’t.
3) Imagine that the medicine has absolutely no effect. Women happen to buy it more because it’s marketed to women, and happen to die less for some unrelated physiological reason.
Now if you look at the population as a whole, you’ll see a strong correlation between taking the medicine and surviving. And even if the medicine has a weak negative effect, that won’t sway the correlation much.
Congratulations, now you have an general understanding of why slicing or merging data can introduce or remove meaningless correlations. You’ll never be able to read the press the same way again.