Previously (Marginal Revolution): Gambling Can Save Science
A study was done to attempt to replicate 21 studies published in Science and Nature.
Beforehand, prediction markets were used to see which studies would be predicted to replicate with what probability. The results were as follows (from the original paper):
Fig. 4: Prediction market and survey beliefs.
The prediction market beliefs and the survey beliefs of replicating (from treatment 2 for measuring beliefs; see the Supplementary Methods for details and Supplementary Fig. 6 for the results from treatment 1) are shown. The replication studies are ranked in terms of prediction market beliefs on the y axis, with replication studies more likely to replicate than not to the right of the dashed line. The mean prediction market belief of replication is 63.4% (range: 23.1–95.5%, 95% CI = 53.7–73.0%) and the mean survey belief is 60.6% (range: 27.8–81.5%, 95% CI = 53.0–68.2%). This is similar to the actual replication rate of 61.9%. The prediction market beliefs and survey beliefs are highly correlated, but imprecisely estimated (Spearman correlation coefficient: 0.845, 95% CI = 0.652–0.936, P < 0.001, n = 21). Both the prediction market beliefs (Spearman correlation coefficient: 0.842, 95% CI = 0.645–0.934, P < 0.001, n = 21) and the survey beliefs (Spearman correlation coefficient: 0.761, 95% CI = 0.491–0.898, P < 0.001, n = 21) are also highly correlated with a successful replication.
That is not only a super impressive result. That result is suspiciously amazingly great.
The mean prediction market belief of replication is 63.4%, the survey mean was 60.6% and the final result was 61.9%. That’s impressive all around.
What’s far more striking is that they knew exactly which studies would replicate. Every study that would replicate traded at a higher probability of success than every study that would fail to replicate.
Combining that with an almost exactly correct mean success rate, we have a stunning display of knowledge and of under-confidence.
Then we combine that with this fact from the paper:
Second, among the unsuccessful replications, there was essentially no evidence for the original finding. The average relative effect size was very close to zero for the eight findings that failed to replicate according to the statistical significance criterion.
That means there was a clean cut. Thirteen of the studies successfully replicated. Eight of them not only didn’t replicate, but showed very close to no effect.
Now combine these facts: The rate of replication was estimated correctly. The studies were exactly correctly sorted by whether they would replicate. None of the studies that failed to replicate came close to replicating, so there was a ‘clean cut’ in the underlying scientific reality. Some of the studies found real results. All others were either fraud, p-hacking or the light p-hacking of a bad hypothesis and small sample size. No in between.
The implementation of the prediction market used a market maker who began anchored to a 50% probability of replication. This, and the fact that participants had limited tokens with which to trade (and thus, had to prioritize which probabilities to move) explains some of the under-confidence in the individual results. The rest seems to be legitimate under-confidence.
What we have here is an example of that elusive object, the unknown known: Things we don’t know that we know. This completes Rumsfeld’s 2×2. We pretend that we don’t have enough information to know which studies represent real results and which ones don’t. We are modest. We don’t fully update on information that doesn’t conform properly to the formal rules of inference, or the norms of scientific debate. We don’t dare make the claim that we know, even to ourselves.
And yet, we know.
What else do we know?
This is a very interesting post that seems to be a clean example of a really important problem. If it’s true, I expect it will be an important building block in my model of the world.
However, I feel confused about it. For example, the participants had limited tokens and the ignorance prior was set before they traded, which appears to have induced underconfidence by default, and it’s not clear to me whether this entire effect is explained by that. Also the blue diamonds aren’t actually a great predictor of the blue circles and I don’t know why that would happen.
So I’m nominating this for review. If people review it in detail and find it’s valid, then I think it’s very important, but they might not, and that’s also valuable work.
I.… had totally forgotten what the actual content of this post was (I looked at it while pondering things to nominate, vaguely remember some anecdote that led up to ‘and therefore, unknown knowns’ exist, and thinking ’well, it might be important that unknown knowns exist, but I haven’t used that in the past year so probably shouldn’t nominate it.)
But, yeah, the meat of this post seems incredibly important-if-true.
Second Bena’s nomination