Causal inference for the home gardener
Note: This is meant to be an accessible introduction to causal inference. Comments appreciated.
Let’s say you buy a basil plant and put it on the counter in your kitchen. Unfortunately, it dies in a week.
So the next week you buy another basil plant and feed it a special powder, Vitality Plus. This second plant lives. Does that mean Vitality Plus worked?
Not necessarily! Maybe the second week was a lot sunnier, you were better about watering, or you didn’t grab a few leaves for a pasta. In other words, it wasn’t a controlled experiment. If some other variable like sun, water, or pasta is driving the results you’re seeing, your study is confounded, and you’ve fallen prey to a core issue in science.
When someone says “correlation is not causation,” they’re usually talking about confounding. Here are some examples:
A 2019 study found that student test scores correlate with the number of books in the home. But this isn’t a reason to start sending (lots of really thin!) books to everyone. It could simply be that books are a proxy for parental intelligence, or that parents with lots of books also hire tutors. Both associations would mean that the kids in book-heavy houses are smarter—without the books playing a causal role. Educated parents do a lot of things (e.g., piano lessons) that get imitated but have dubious benefits.
Prison inmates tend to live shorter lives. Advocates and researchers have long blamed prison for this, citing unsafe and unsanitary conditions. But a recent careful study found that convicted offenders sent to prison live longer, not shorter, than defendants who dodged a prison sentence. Deaths from car crashes, drug overdoses, and even heart attacks were lower in prison. The confound in the previous poorly-controlled studies was essentially lifestyle. People living on the margins of incarceration are already at much higher risk of death—even if they never get caught and sentenced. While there’s a correlation between prison time and an early death, it’s actually not due to prison. Prison increases lifespan for people unfortunate enough to be standing trial for serious offenses.
Cardiologists used to think that Vitamin E reduced heart attack risk because of several studies tracking health outcomes and diet. But years later, a more reliable trial (where subjects were randomized to receive Vitamin E supplements or a placebo) demonstrated that, if anything, Vitamin E actually increases heart attack risk. Why the flip? There must have been some lingering confounding in the initial studies, wherein generally healthy people also happened to get more dietary vitamin E. This made vitamin E look beneficial, when actually it was having a neutral to negative impact. As with a lot of studies on diet, it would have been prudent to reserve judgment until randomized experiments could confirm the hunch.
So now you know that you shouldn’t compare plants that you bought at different times, because this risks confounding. One way to address confounding is to try to hold all the important variables constant—a controlled experiment. You buy two plants at the same time from the same store. You put them in the same spot and water them equally, and always pluck the same number of leaves from each. The treated plant survives, and the control plant withers.
Does the powder work? A remaining problem is that even holding constant many of the variables (store, date bought, and so on), there’s still some inherent randomness in the life of a basil plant.
This randomness could be due to genetics or the soil conditions when it was a wee sprout. With enough plants, it would wash out, with either group as likely to be lucky as unlucky on average. With just two plants, however, it’s likely that random factors would cloud or even exceed the benefit from the powder. When the measured benefit in your study is plausibly just random noise, your study is underpowered. In engineering, this could be seen as a signal-to-noise problem. With only two plants, the noise (random variation) might overwhelm the signal (the effect of Vitality Plus).
Most of the time, we fix power issues by increasing the sample size. A/B test calculators online allow you to input an expected effect and the level of certainty you want. The sample size you get is the minimum number of people (or plants) you would need to be relatively certain that your experimental manipulation really drove the effect. For huge effects, you need smaller sample sizes.
Parents often spout “sample of 2” studies like this. “My first kid struggled at reading, but the second one reads at an advanced grade level. The second one did Montessori—it works wonders!” In reality, two kids is probably not enough to learn about school efficacy since there are so many different factors driving educational achievement. It’s a low signal-to-noise ratio.
Super small studies can be helpful when there’s very little random noise in the outcome. Say a physician tries one last-ditch chemo drug on a cancer patient facing a 99.9% chance of death in the following year based on past patients with the same condition. If the patient lives, we can be pretty confident the drug worked. They were extremely unlikely to have survived by random luck—so it must have been something the doctor was trying. So it’s not erroneous for a biologist to compare only two Petri dishes: a highly-controlled environment can reduce random noise to near zero.
Now you know that you shouldn’t compare plants raised in different conditions (because there could be confounding) and you can’t just compare two plants, even with lots of control over their conditions (because of random variation—one plant could get lucky, independent of Vitality Plus).
We need a large sample of plants with random variation in which one gets treated. What are some of the techniques?
Experiment or Randomized trial: A randomized trial is the gold standard of causal inference. It’s how we figured out that the COVID vaccines worked, and how most drugs are approved. In a randomized trial, you flip a coin to determine if each plant gets Vitality Plus or the usual treatment. With a large enough sample, you could be certain that most of the difference is due to the randomized treatment. In contrast, you might find that survival rates are identical, suggesting a zero effect.
Even within randomized trials, there are ways to make your estimate more precise. One method is stratifying. In the basil example, this might mean ordering the plants by height and then treating every other plant. This helps with precision because it more tightly equalizes the starting height of the plants across groups, shrinking the portion of random noise that will come from their initial health.
Note that the initial coin flip is powerful, even if you mistakenly powdered some control plants and forgot to medicate some treated ones. You run an intent-to-treat analysis, which analyzes all plants based on their assigned category: you look at the average survival of those you intended to powder vs. those you intended to leave alone. The mixups decrease statistical power a bit but do not invalidate your experiment, as long as you base all analysis on the intent, not what actually happened. This is why, for example, it doesn’t ruin a colonoscopy study if we can merely encourage the treatment group to get screened and only 40% do. We still learn a lot about the effect of colonoscopies because the encouragement itself created a lot of random uptake.
You could also do a within-subjects design, where all plants eventually get the treatment, and look at the average change in growth in weeks where they are vs. aren’t getting the powder. This also shrinks the random noise portion because you’re taking out any fixed differences across plants and just looking at their change in growth.
Quasi-experiment: if you can’t control the treatment, the next best thing is to find a situation that leads to essentially random variation in the treatment. For example, maybe basil buyers at Safeway bought Vitality Plus but not buyers at Whole Foods since they don’t carry it. If you think that basil plants cared for by the two groups of shoppers will otherwise have identical trajectories, you can use a comparison of plant outcomes for Whole Foods vs. Safeway shoppers to figure out if Vitality Plus works. In this case, because it creates random variation in treatment without (we think) affecting other things, the grocery store is what’s called an instrumental variable.
The difficulty is that whether one of these variables is as-good-as-random is often up for debate, and it’s difficult to prove it one way or another. For example, one plausible concern here might be that Whole Foods shoppers are more wealthy, with more sunlight in their kitchens. To test this and related concerns, you could look at the past plant survival rates of the two groups of shoppers. If they were identical, you might be more encouraged that you really are isolating variation in the powder.
Some successful instrumental variables include: random assignment to judges in criminal cases (variation in strictness helps to measure the causal effect of incarceration), being born just before vs. after a new maternal leave policy (the lucky babies got way more time with mom), and getting an ER physician who is more loose with opioid prescriptions (it makes you more likely to have drug problems in the future).
Whether you’re testing plant powder, educational methods, or medical treatments, the principles remain the same: Watch out for confounding variables. Use large enough samples to overcome random noise. And create or find random variation in treatment take-up for a reliable estimate. These provide some of the best defense against bad ideas that invariably sprout up.
I think I’d’ve wanted to know about tigramite when learning about causal inference, it’s a library for doing causal inference on time-series data.