Some heuristics I use for deciding how much I trust scientific results

I’ve done nothing to test these heuristics and have no empirical evidence for how well they work for forecasting replications or anything else. I’m going to write them anyway. The heuristics I’m listing are roughly in order of how important I think they are. My training is as an economist (although I have substantial exposure to political science) and lots of this is going to be written from an econometrics perspective.

How much does the result rely on experimental evidence vs causal inference from observational evidence?

I basically believe without question every result that mainstream chemists and condensed matter physicists say is true. I think a big part of this is that in these fields it’s really easy to experimentally test hypotheses, to build really precisely test differences in hypotheses experimentally. This seems great.

On the other hand, when relying on observational evidence to get reliable causal inference you have to control for confounders while not controlling for colliders. This is really hard! It generally requires finding a natural experiment that introduces randomisation or having very good reason to think that you’ve controlled for all confounders.

We also make quite big updates on which methods effectively do this. For instance until last year we thought that two-way fixed effects did a pretty good job of this before we realised that actually heterogeneous treatment effects are a really big deal for two-way fixed effects estimators.

What’s more, in areas that use primarily observational data there’s a really big gap between fields in how often papers even try to use causal inference methods and how hard they work to show that their identifying assumptions hold. I generally think that modern microeconomics papers are the best on this and nutrition science the worst.

I’m slightly oversimplifying by using a strict division between experimental and observational data. All data is observational and what matters is credibly you think you’ve observed what would happen counterfactually without some change. But in practice, this is much easier in settings where we think that we can change the thing we’re interested in without other things changing.

There are some difficult questions around scientific realism here that I’m going to ignore because I’m mostly interested in how much we can trust a result in typical use cases. The notable area where I think this actually bites is thinking about the implications of basic physics for longtermism where it does seem like basic physics actually changes quite a lot over time with important implications for questions like how large we expect the future to be.

Are there practitioners using this result and how strong is the selection pressure on the result

If a result is being used a lot and there would be easily noticeable and punishable consequences if the result was wrong, I’m way more likely to believe that the result is at least roughly right if it’s relied on a lot.

For instance, this means I’m actually really confident that important results in auction design hold. Auction design is used all the time by both government and private sector actors in ways that earn these actors billions of dollars and, in the private sector case at least, are iterated on regularly.

Auction theory is an interesting case because it comes out of pretty abstract microeconomic theory and wasn’t developed really based on laboratory experiments, but I’m still pretty confident in it because of how widely it’s used by practitioners and is subject to strong selection pressure.

On the other hand, I’m much less confident in lots of political science research. It seems like places like hedge funds don’t use it that much to predict market outcomes, it doesn’t seem to be used by governments that much, and it’s really hard to know how counterfactually important, say, World Bank programs that use political science were.

How large is the literature that supports the result and how many techniques have been used to support

This view actually does have some empirical support. There’s this nice paper where a load of different researchers are given the same (I think simulated) data and looked at how researchers result. They found that there was quite a lot of difference between what researchers found based on things like their coding choices and what statistical techniques they used, but that when there was a real effect the average paper found an effect of right sign and roughly right magnitude, and when there was no real effect the average researcher found roughly no effect. I’m afraid I can’t find either the paper and I can’t be bothered to link to the Noah Smith or Matt Clancy blog posts on it.

Mostly though I use this heuristic because it seems pretty sensible.

External validity

External validity is how likely it is that a result generalises from whatever the study setting was to the setting in which the result is used.

I think this is a really big deal for lots of RCT-based development economics. We just see really quite often that results that seem to consistently hold when tested with RCTs don’t hold when scaled up.

I’m more sceptical of the external validity of a result the more intensive the intervention is and so the more buy-in and effort is needed from participants and researchers. Seems pretty likely that when the intervention is used it won’t have as much effort put into it. I’m particularly sceptical if the intervention is complex or precise.

Results given statistical power

Statistical power says how likely it is to see an effect size given the true effect size and the sample size. If the statistical power of a test is low but significant results are found, it’s likely that the researcher just got lucky and the true effect size is much smaller and/or the opposite sign.

The intuition for this is that if a statistical test is underpowered—say for this example is under 50% - then it’s unlikely that a statistically significant effect is found.

If a statistically significant effect is found then something weird must have happened, like the specific sample that was used stochastically having really large effect sizes. The intuition for this is that if you have a small sample size (and so relatively high variance) and are very unwilling to accept mistakes in the direction of finding effects that aren’t there, you need a really large mean effect size to be confident that there’s any effect at all! This effect size has to be larger than the mean effect size because, by assumption, you’re test is unlikely to detect an effect given the true distribution of the variable in question—this is what it means for a test to have low power.

More sinisterly, it could also imply some selection effect for which results are observed, like publication bias or the methods the researchers used.

I want to caveat this section by saying that I don’t have a very good intuition for power calculations and how much they actually affect how likely results are to replicate.

How strong is the social desirability bias at play

This seems somewhat important, but I think is often overplayed in the EA and rationality communities. But it does in practice mean that I think I’m less likely to see papers that find, say, that child poverty has no effect on future outcomes. My vibe is that psychology seems particularly bad for this for some reason?

But also I see papers that find socially undesirable results all the time!

For instance, this paper finds negative effects of democracy on state capacity for places with middling levels of democracy, this paper finds higher levels of interest in reading amongst preschool-age girls, and this paper finds no association between youth unemployment and crime. It’s really easy to find these papers! You just search for them on Google Scholar.

Have there been formal tests of publication bias

We can test whether the distribution of results on a specific question looks like it should if publications were biased independent of the sign and magnitude of their results. I’m a lot less confident in a field if it consistently finds publication bias.