One Study, Many Results (Matt Clancy)

Link post

I didn’t see this post, its author, or the study involved elsewhere on LW, so I’m crossposting the content. Let me know if this is redundant, and I’ll take it down.

Summary

This post looks at cases where teams of researchers all began with the same data, then used it to answer a question — and got a bunch of different answers, based on their different approaches to statistical testing, “judgment calls”, etc.

This shows the difficulty of doing good replication work even without publication bias; none of the teams here had any special incentive to come up with a certain result, and they all seemed to be doing their best to really answer the question.

Also, I’ll copy the conclusion of the post and put it here:

More broadly, I take away three things from this literature:

  1. Failures to replicate are to be expected, given the state of our methodological technology, even in the best circumstances, even if there’s no publication bias.

  2. Form your ideas based on suites of papers, or entire literatures, not primarily on individual studies.

  3. There is plenty of randomness in the research process for publication bias to exploit. More on that in the future.

The post

Science is commonly understood as being a lot more certain than it is. In popular science books and articles, an extremely common approach is to pair a deep dive into one study with an illustrative anecdote. The implication is that’s enough: the study discovered something deep, and the anecdote made the discovery accessible. Or take the coverage of science in the popular press (and even the academic press): most coverage of science revolves around highlighting the results of a single new (cool) study. Again, the implication is that one study is enough to know something new. This isn’t universal, and I think coverage has become more cautious and nuanced in some outlets during the era of covid-19, but it’s common enough that for many people “believe science” is a sincere mantra, as if science made pronouncements in the same way religions do.

But that’s not the way it works. Single studies—especially in the social sciences—are not certain. In the 2010s, it has become clear that a lot of studies (maybe the majority) do not replicate. The failure of studies to replicate is often blamed (not without evidence) on a bias towards publishing new and exciting results. Consciously or subconsciously, that leads scientists to employ shaky methods that get them the results they want, but which don’t deliver reliable results.

But perhaps it’s worse than that. Suppose you could erase publication bias and just let scientists choose whatever method they thought was the best way to answer a question. Freed from the need to find a cool new result, scientists would pick the best method to answer a question and then, well, answer it.

The many-analysts literature shows us that’s not the case though. The truth is, the state of our “methodological technology” just isn’t there yet. There remains a core of unresolvable uncertainty and randomness in the best of circumstances. Science isn’t certain.

Crowdsourcing Science

In many-analyst studies, multiple teams of researchers test the same previously specified hypothesis, using the exact same dataset. In all the cases we’re going to talk about today, publication is not contingent on results, so we don’t have scientists cherry-picking the results that make their results look most interesting; nor do we have replicators cherry-picking results to overturn prior results. Instead, we just have researchers applying judgment to data in the hopes of answering a question. Even still results can be all over the map.

Let’s start with a really recent paper in economics: Huntington-Klein et al. (2021). In this paper, seven different teams of researchers tackle two research questions that had been previously published in top economics journals (but which were not so well known that the replicators knew about them). In each case, the papers were based on publicly accessible data, and part of the point of the exercise was to see how different decisions about building a dataset from the same public sources lead to different outcomes. In the first case, researchers used variation across US states in compulsory schooling laws to assess the impact of compulsory schooling on teenage pregnancy rates.

Researchers were given a dataset of schooling laws across states and times, but to assess the impact of these laws on teen pregnancy, they had to construct a dataset on individuals from publicly available IPUMS data. In building the data, researchers diverged in how they handled different judgement calls. For examples:

One team dropped data on women living in group homes; others kept them.

Some teams counted teenage pregnancy as pregnancy after the age of 14, but one counted pregnancy at the age of 13 as well

One team dropped data on women who never had any children

In Ohio, schooling was compulsory until the age of 18 in every year except 1944, when the compulsory schooling age was 8. Was this a genuine policy change? Or a typo? One team dropped this observation, but the others retained it.

Between this and other judgement calls, no team assembled exactly the same dataset. Next, the teams needed to decide how, exactly, to perform the test. Again, each team differed a bit in terms of what variables it chose to control for and which it didn’t. Race? Age? Birth year? Pregnancy year?

It’s not immediately obvious which decisions are the right ones. Unfortunately, they matter a lot! Here were the seven teams’ different results.

Depending on your dataset construction choices and exact specification, you can find either that compulsory schooling lowers or increases teenage pregnancy, or has no impact at all! (There was a second study as well—we will come back to that at the end)

This isn’t the first paper to take this approach. An early paper in this vein is Silberzahn et al. (2018). In this paper, 29 research teams composed of 61 analysts sought to answer the question “are soccer players with dark skin tone more likely to receive red cards from referees?” This time, teams were given the same data but still had to make decisions about what to include and exclude from analysis. The data consisted of information on all 1,586 soccer players who played in the first male divisions of England, Germany, France and Spain in the 2012-2013 season, and for whom a photograph was available (to code skin tone). There was also data on player interactions with all referees throughout their professional careers, including how many of these interactions ended in a red card and a bunch of additional variables.

As in Huntington-Klein et al. (2021), the teams adopted a host of different statistical techniques, data cleaning methods, and exact specifications. While everyone included “number of games” as one variable, just one other variable was included in more than half of the teams regression models. Unlike Huntington-Klein et al. (2021), in this study, there was also a much larger set of different statistical estimation techniques. The resulting estimates (with 95% confidence intervals) are below.

Is this good news or bad news? On the one hand, most of the estimates lie between 1 and 1.5. On the other hand, about a third of the teams cannot rule out zero impact of skin tone on red cards; the other two thirds find a positive effect that is statistically significant at standard levels. In other words, if we picked two of these teams’ results at random and called one the “first result” and the other a “replication,” they would only agree whether the result is statistically significant or not about 55% of the time!

Let’s look at another. Breznau et al. (2021) get 73 teams, comprising 162 researchers to answer the question “does immigration lower public support for social policies?” Again, each team was given the same data. This time, that consisted of responses to surveys about support for government social policies (example: “On the whole, do you think it should or should not be the government’s responsibility to provide a job for everyone who wants one?”), measures of immigration (at the country level), and various country-level explanatory variables such as GDP per capita and the Gini coefficient. The results spanned the spectrum of possible conclusions.

Slightly more than half of the results found no statistically significant link between immigration levels and support for policies—but a quarter found more immigration reduced support, and more than a sixth found more immigration increased support. If you picked two results at random, they would agree on the direction and statistical significance of the results less than half the time!

We could do morestudies, but the general consensus is the same: when many teams answer the same question, beginning with the same dataset, it is quite common to find a wide spread of conclusions (even when you remove motivations related to beating publication bias).

At this point, it’s tempting to hope the different results stem from differing levels of expertise, or differing quality of analysis. “OK,” we might say, “different scientists will reach different conclusions, but maybe that’s because some scientists are bad at research. Good scientists will agree.” But as best as these papers can tell, that’s not a very big factor.

The study on soccer players tried to answer this in a few ways. First, the teams were split into two groups based on various measures of expertise (teaching classes on statistics, publishing on methodology, etc). The half with greater expertise was more likely to find a positive and statistically significant effect (78% of teams, instead of 68%), but the variability of their estimates was the same across the groups (just shifted in one direction or another). Second, the teams graded each other on the quality of their analysis plans (without seeing the results). But in this case, the quality of the analysis plan was unrelated to the outcome. This was the case even when they only looked at the grades given by experts in the statistical technique being used.

The last study also split its research teams into groups based on methodological expertise or topical expertise. In neither case did it have much of an impact on the kind of results discovered.

So; don’t assume the results of a given study are definitive to the question. It’s quite likely that a different set of researchers, tackling the exact same question and starting with the exact same data would have obtained a different result. Even if they had the same level of expertise!

Resist Science Nihilism!

But while most people probably overrate the degree of certainty in science, there also seems to be a sizable online contingent that has embraced the opposite conclusion. They know about the replication crisis and the unreliability of research, and have concluded the whole scientific operation is a scam. This goes too far in the opposite direction.

For example, a science nihilist might conclude that if expertise doesn’t drive the results above, then it must be that scientists simply find whatever they want to find, and that their results are designed to fabricate evidence for whatever they happen to believe already. But that doesn’t seem to be the case, at least in these multi-analyst studies. In both the study of soccer players and the one on immigration, participating researchers reported their beliefs before doing their analysis. In both cases there wasn’t a statistically significant correlation between prior beliefs and reported results.

If it’s not expertise and it’s not preconceived beliefs that drive results, what is it? I think it really is simply that research is hard and different defensible decisions can lead to different outcomes. Huntington-Klein et al. (2021) perform an interesting exercise where they apply the same analysis to different teams data, or alternatively, apply different analysis plans to the same dataset. That exercise suggests roughly half of the divergence in the teams conclusions stems from different decisions made in the database construction stage and half from different decisions made about analysis. There’s no silver bullet—just a lot of little decisions that add up.

More importantly, while it’s true that any scientific study should not be viewed as the last word on anything, studies still do give us signals about what might be true. And the signals add up.

Looking at the above results, while I am not certain of anything, I come away thinking it’s slightly more likely that compulsory schooling reduces teenage pregnancy, pretty likely that dark skinned soccer players get more red cards, and that there is no simple meaningful relationship between immigration and views on government social policy. Given that most of the decisions are defensible, I go with the results that show up more often than not.

And sometimes, the results are pretty compelling. Earlier, I mentioned that Huntington-Klein et al. (2021) actually investigated two hypotheses. In the second, Huntington-Klein et al. (2021) ask researchers to look at the effect of employer-provided healthcare on entrepreneurship. The key identifying assumption is that in the US, people become eligible for publicly provided health insurance (Medicare) at age 65. But people’s personalities and opportunities tend to change more slowly and idiosyncratically—they also don’t suddenly change on your 65th birthday. So the study looks at how rates of entrepreneurship compare between groups just older than the 65 threshold and those just under it. Again, researchers have to build a dataset from publicly available data. Again every team made different decisions, such that none of the data sets are exactly alike. Again, researchers must decide exactly how to test the hypothesis, and again they choose slight variations in how to test it. But this time, at least the estimated effects line up reasonably well.

I think this is pretty compelling evidence that there’s something really going on here—at least for the time and place under study.

And it isn’t necessary to have teams of researchers generate the above kinds of figures. “Multiverse analysis” asks researchers to explicitly consider how their results change under all plausible changes to the data and analysis; essentially, it asks individual teams to try and behave like a set of teams. In economics (and I’m sure in many other fields—I’m just writing about what I know here), something like this is supposedly done in the “robustness checks” section of a paper. In this part of a study, the researchers show how their results are or are not robust to alternative data and analysis decisions. The trouble has long been that robustness checks have been selective rather than systematic; the fear is that researchers highlight only the robustness checks that make their core conclusion look good and bury the rest.

But I wonder if this is changing. The robustness checks section of economics papers has been steadily ballooning over time, contributing to the novella-like length of many modern economics papers (the average length rose from 15 pages to 45 pages between 1970 and 2012). Some papers are now beginning to include figures like the following, which show how the core results change when assumptions change and which closely mirror the results generated by multiple-analyst papers. Notably, this figure includes many sets of assumptions that show results that are not statistically different from zero (the authors aren’t hiding everything).

Economists complain about how difficult these requirements make the publication process (and how unpleasant they make it to read papers), but the multiple-analyst work suggests it’s probably still a good idea, at least until our “methodological technology” catches up so that you don’t have a big spread of results when you make different defensible decisions.

More broadly, I take away three things from this literature:

  1. Failures to replicate are to be expected, given the state of our methodological technology, even in the best circumstances, even if there’s no publication bias.

  2. Form your ideas based on suites of papers, or entire literatures, not primarily on individual studies.

  3. There is plenty of randomness in the research process for publication bias to exploit. More on that in the future.