Intro
Simpson’s paradox is a phenomenon in statistics where a choice or treatment performs better in all sub-populations but worse in the overall population. Since first learning of Simpson’s Paradox, I’ve struggled to develop a gut-level understanding that lets me easily detect cases of it and clearly explain its cause. This is even after reading Michael Nielsen’s wonderful post that provides a number of different ways of thinking about it and Judea Pearl’s fascinating analysis of it from a causality perspective. In this post, I describe an example and intuitions that finally led to me (seemingly) grokking Simpson’s paradox.
The Parable of Murderball and Tea Party
Imagine we learn of a hypothetical world in which there are two countries named Gentlantis and Rageopolis. Both countries have populations of 1,000,000 people. Further, in both countries, every single individual plays one of two sports (exclusively): Murderball or Tea party. Rageopolis’s residents love Murderball so 99.99% ($ 999,900 / 1,000,000 $) prefer it to Tea Party. On the other hand, the Genglantians love Tea Party and don’t care for Murderball’s complex rule system and drawn out games, so 99% (990,000/1,000,000) play Tea Party.
From the name, you might guess that Murderball has a much higher injury rate than Tea Party—every year, 99% of people who play Murderball get injured, compared to only 2% of annual Tea Party players. As a result, we can summarize the sports injury numbers and rates in the respective countries as follows.
Country | Murderball Players / Injuries | Tea Party Players / Injuries | Overall Injuries and Rate |
---|---|---|---|
Rageopolis | 999,900 / 989,901 | 100 / 2 | 989,902 / 99% |
Gentlantis | 10000 / 9900 | 990,000 / 19,800 | 19,800 / 3% |
As you can see, Rageopolis has a sports injury epidemic. Around 999,000 Rageopolans get hurt every year playing sports. This makes Rageopolis’s politicians look bad when the UN compares their injury rate to Gentlantis’s approximately 3%.
Technically, this is not yet an instance of Simpson’s paradox since the two games have the same injury rate in both countries. But now, say Rageopolis’s government decides they’re tired of looking bad compared to Gentlantis and enacts a strict law that required anyone playing either sport to wear a helmet. Lo and behold, the law succeeds beyond anyone’s wildest expectation, the Murderball and Tea Party injury rates halve in Rageopolis dropping from 99.99% and 2% to 49.99% and 1% respectively. The UN releases their report for the year and it contains the following numbers.
Country | Murderball Players / Injuries | Tea Party Players / Injuries | Overall Injuries and Rate |
---|---|---|---|
Rageopolis | 999,900 / 485100 | 100 / 1 | 989,902 / 49% |
Gentlantis | 10000 / 9900 | 990,000 / 19,800 | 19,800 / 2% |
Observe that even though both games are much safer in Rageopolis than Gentlantis, Rageopolis still has an order-of-magnitude more sports injuries. Although it may not feel like it, since it might seem obvious when presented this way, this is an example of Simpson’s paradox. Rageopolis residents who play both sports get hurt at much lower rates than Gentlantians yet Rageopolans get hurt playing sports at a much higher rate overall.
Hypothesis about intuitiveness
When I initially came up with this example, it felt much more obvious to me than other examples of Simpson’s paradox. I suspect this results from two major features of the example:
-
It can be described qualitatively in a way that fits with my (and presumably others’ intuition). More specifically, I claim that if I describe the above hypothetical without any numbers as follows,
There are two countries, Rageopolis and Gentlantis. The same number of people play sports in both. But, in Rageopolis the vast majority of people play a super dangerous game, whereas the vast majority of people in Gentlantis play a super safe game. Naturally, Rageopolis has a much higher sports injury rate. If Rageopolis makes both games somewhat but not an order-of-magnitude safer, which country will have a lower injury rate?
most people would still give the right answer (Rageopolis).
-
Using big numbers and extreme class imbalances makes the difference between overall rate and per-class rates obvious in a way it’s not when the differences are relatively small. It’s hard for me to even describe, but for whatever reason, past some certain threshold of class imbalance (satisfied by the example above), Simpson’s Paradox goes from feeling like a paradox to an obvious fact.
Unit testing intuition
To verify that the above example actually helps make sense of Simpson’s Paradox in other examples, let’s see if it can help us grok the first example from Nielsen’s post.
The first example Nielsen’s describes is a situation in which there are two treatments for kidney stones, A and B, and treatment B works better on patients with large and small kidney stones but worse overall. Copying over his table, the hypothetical data is as follows.
Group | Treatment A helps | Treatment B helps |
---|---|---|
Large kidney stones | 69% (55 / 80) | 73% (192 / 263) |
Small kidney stones | 87% (234 / 270) | 93% (81 / 87) |
All patients | 83% (289 / 350) | 78% (273 / 350) |
(I’m embarrassed to admit that, before coming up with the Rageopolis / Gentlantis example, and even after reading Nielsen’s post twice, I still had to spend a good amount of time reminding myself how the above combination was possible.)
The puzzle here is figuring out how one treatment can be better for both groups individually but worse overall and, ideally which treatment would be better to accept as a patient. As a starting point, let’s analogize this example to the Murder Ball one. There are two divisions in this example, treatment group and kidney stone size. We want to relate these to the two divisions from our above example, nationality and preferred sport. Rates were measured by treatment group, so it seems like treatment groups are analogous to countries in our example. This means that kidney stone size should be analogous to preferred sport.
Following our above example, the next step that comes to mind is to see whether things become more obvious when we increase the imbalance between the two kidney stone size groups. Looking at the table, large kidney stone people seem to have worse outcomes overall, similar to Murderball players. So, let’s imagine that instead of having 80 vs. 263 large kidney stone people receiving treatments A and B respectively we have 10 and 340. Let’s also imagine that treatment A helps 10% of large kidney stone people and treatment B helps 20% of large kidney stone people. Then, we should also adjust the small kidney stone groups accordingly to contain 340 and 10 people with help rates of 45% and 90% respectively. The following table summarizes the results with these updated numbers.
{.table}
Group | Treatment A helps | Treatment B helps |
---|---|---|
Large kidney stones | 10% (1 / 10) | 20% (68 / 340) |
Small kidney stones | 45% (153 / 340) | 90% (9 / 10) |
All patients | 44% (154 / 340) | .22% (77 / 350) |
Looking at this table, I’m again left feeling feeling like the paradox has mostly dissolved. From the information we have, it seems like large kidney stone people just have much worse outcomes overall and that the treatment B group’s overall help rate reflects that. Related to that, it seems like treatment B works better than treatment A, although we should probably be somewhat suspicious of these results given that the selection process produced such imbalanced classes in the first place.
Conclusion
Part of the reason learning technical material is hard is because different ways of explaining things work for different people (note that I’m not claiming anything like “learning styles” or “multiple intelligences”, since my impression is that both of those hypotheses have mostly failed to replicate). Beyond clarifying my own intuition, I hope that perhaps this post can help at least one person who is as confused about Simpson’s Paradox as I was become less confused.
Here’s the simplest explanation of Simpson’s paradox that I know: take any two variables that are positively correlated, for example height and weight. Now consider all people whose height in cm + weight in kg equals 250. In that group, height and weight are negatively correlated. But all people can be divided into such groups :-)
Nice, I’d neither heard nor thought of this framing before. Thanks!