We have some evidence that masks work

by Gavin Leech and Charlie Rogers-Smith

Our work on masks vs COVID at the population level was recently reproduced with a bunch of additional experiments. These seem to cast doubt on our results, but we think that each of them is misguided. Since the post got some traction on LW and Marginal Revolution, we decided to respond.

Nevertheless, thanks to Mike, who put a lot of work in, and who was the only person in the world to check our results, despite plenty of people trying to gotcha us on Twitter.

“Observational Window”

Best-guess summary of Mike’s analysis: he extends the window of analysis by a bit and runs our model. He does this because he’s concerned that we chose a window with low transmissibility to make masks look more effective than they are. However, he finds similar results to the original paper, and concludes that our results seem robust to longer periods.

But as our paper notes, a longer window isn’t valid using this data. After September, many countries move to subnational NPIs, and our analysis is national. The way our NPI data source codes things means that they don’t capture this properly, and so they stop being suitable for national analyses.

Estimates of national mask effect after this don’t properly adjust for crucial factors, and so masks will “steal” statistical power from them. So this analysis isn’t good evidence about the robustness of our results to a longer window.

“Regional Effects”

MH: “If mask wearing causes a drop in transmissibility, then regions with higher levels of mask wearing should observe lower growth rates.”

Best-guess summary of Mike’s analysis: A correlational analysis between the median wearing level of a region and the (the expected number of new cases per initial case in a region) that our model infers. (What he calls ‘growth rates’, but which are not growth rates.) He claims that if wearing is effective then the correlation should be negative. The intuition is that if masks work, then countries with lots of mask-wearing should have lower transmissibility. Instead, he finds that the correlation is positive.

This is interesting, but the conclusion doesn’t seem right. You can tell a bunch of stories about why mask-wearing might be correlated with , independent of mask effectiveness. For example, it seems plausible that when transmissibility increases, more people wear masks. Overall, the correlation between average mask-wearing and constant transmissibility is such weak evidence that we honestly don’t know in which direction to update our beliefs based on this result.

It’s also worth highlighting how much information this analysis discards. He takes a scalar (the regional ) and then plots it against a scalar (the median wearing level). But this averages away almost all of the information.

one day of our data, in one region
Mike’s analysis removes all time info

It’s hard to say anything about the relationship between and wearing as a static average. You have to look at changes in mask-wearing.

MH: “Within a given region, increased mask usage is correlated with lower growth rates (the 25% claimed effectiveness), but when comparing across regions masks seem to be ineffective.”

Even given fixes to the above, this doesn’t follow. Our posterior with a median of 25% is a pooled estimate across regions.

Endogeneity (the estimate being biased by, for instance, people masking up in response to an outbreak) is a real concern, but the above doesn’t show this either. We can see how serious endogeneity could be by looking at the correlation between mask level and case level: = 0.05.

“Uniform Regional Transmissibility”

MH: “The first experiment was to force all regions to share the same base transmissibility. This provided an estimate that masks had an effectiveness of −10%”

Best-guess summary of Mike’s analysis: Mike sets all s to the same value and runs the model. He does this to isolate the ‘relative’ effect of mask-wearing—i.e. the effect from day-to-day changes in wearing, as opposed to absolute mask-wearing.

I think the intuition comes from the fact that we use two sources of info to determine mask-wearing effectiveness: the starting level of mask-wearing in a region, and day-to-day changes. It would be cool to see what we infer from only day-to-day changes in wearing. But this method doesn’t achieve this; instead, setting all the s to be the same will bias the wearing effect

To see this, suppose we have data on two regions. Region A has an and B has an , but we don’t know these values. Further assume that region B has more mask-wearing, which is consistent with Mike’s finding that there’s a small positive correlation between and mask-wearing. What happens if we force these values to be the same, say ? Well, the model will use the mask-wearing effect to shrink A’s 1.0 down to 0.5, and to pull B’s 1.0 up to 1.5. And since B has more mask-wearing relative to A, this is only possible when the mask-wearing effect is negative. So fixing region s to be the same creates a strong negative bias on the mask effect estimate.

Can we do better? We think so! As we mentioned to Mike in correspondence, a better way to isolate the effect from day-to-day changes in wearing would be to zero-out wearing at the start of the period, so that no information from wearing levels can inform our estimate of R0. We tried this analysis and got a 40% reduction in R from mask-wearing, with large uncertainty (because we’re removing an information source).

“No Mask Variation”

MH: “The next experiment was to force each region to use a constant value for mask wearing (the average value in the time period).

Best-guess summary of Mike’s analysis: let’s isolate the absolute effect now! To do this, set mask-wearing to be constant across the period.

Most of what our model uses to estimate the effect is day-to-day changes in wearing and transmission. If mask-wearing is set constant, our model will still be ‘learning’ about the mask-wearing effect from day-to-day changes in transmissibility, even if mask-wearing doesn’t change—it will just be learning from false data. Setting mask-wearing constant is not [inferring nothing from day-to-day changes in transmission], it’s inferring false things.

“Data Extrapolation”

MH: “the failure of large absolute differences in variable X across regions to meaningfully impact the observed growth rate … should make us skeptical of large claimed effects”

Best-guess summary of Mike’s analysis: Let’s compare changes in wearing from April to May to changes in growth rates. If masks work then we should find a strong negative correlation.

The method:

  1. For April and May, take the average wearing

  2. May average mask wearing—April average mask wearing (x-axis).

  3. AprilCaseRatio = Cases @ end April /​ cases @ start April

  4. MayCaseRatio = Cases @ end May /​ cases @ start May

  5. “Growth rate” = AprilCaseRatio /​ MayCaseRatio (y-axis).

  6. Scatterplot each region

(This throws away even more useful information—picking out two days in April and May throws away 96% of the dataset.)

But this analysis doesn’t account for any of the known factors on transmission. The crux here is whether we’d expect to ‘see’ the effect of wearing amidst all the variation in other factors. Averaging over 90 or so regions could smooth out random factors. However, if factors are persistent across regions—factors that, for example, result in increased transmission across most countries—then this method will not uncover the wearing effect. And in fact there are strong international trends in factors in May 2020. We wouldn’t update much unless the correlation was particularly strong.

However, let’s assume we could find the wearing effect using this method. Mike implies this slope should be more strongly negative than he observes. How negative should it be, under our model (25% reduction in due to masks)? Let’s look at Mike’s plot and compare it to the mean of our posterior (which you shouldn’t do but anyway):

Mike’s simple linear regression. ~20% decrease
The reduction inferred by our model over this range. ~12.5% decrease

The slope he finds is similar to our model estimate—in fact it’s more negative. (Looks like a decrease in Growth Difference for +50% wearing.) In our plot, y is the reduction in , not Mike’s ratio of case ratios, but it gives you an idea of what a point estimate of our claim looks like on a unit scale, on the same figure grid.


It’s difficult to see medium-sized effects with methods this simple, and this is why we use a semi-mechanistic model. Correlation analysis between cases (or growth) vs wearing neglects important factors we need to adjust for (e.g. mobility at 40% effect!). Moreover, doing it in the way described neglects a lot of data.

(Charlie wants to indicate that he isn’t confident in the following paragraph—not because he disagrees, but because he hasn’t been following the broader literature.)

Even if the experiments above showed what they purport to, Mike’s title, ‘We Still Don’t Know If Masks Work’, would still be misleading. By this point we have convergent lines of evidence: meta-analyses of clinical trials, slightly nightmarish animal models, mechanistic models, an ok cloth mask study. We should get the big RCT results soon: I (Gavin) am happy to bet Mike $100 that this will find a median reduction in R greater than 15% (for the 0-100% effect).

Our paper is observational, and there are limits to how strong such evidence can be. Mike says “they are not sufficient to prove a causal story”; this much we agree on.


Lastly: we exchanged 13 emails with Mike, helping him get the model converging and explaining most of these errors (though not as extensively). It was disappointing to find none of them corrected, no mention of our emails, no code, and no note that he had posted his work.