A quick and crude comparison of epidemiological expert forecasts versus Metaculus forecasts for COVID-19
Katherine Milkman on Twitter notes how far off the epidemiological expert forecasts were in the linked sample:
https://twitter.com/katy_milkman/status/1244668082062348291
They gave an average estimate of 20,000 cases. The actual outcome was 122,653 by the stated date in the U.S. That’s off by a factor of 6.13, and those were experts. Only 3 out of 18 survey respondents managed to get the actual outcome to fall in their 80% confidence interval. If they were perfectly calibrated on this one-off prediction, about 14 should’ve had the actual outcome fall in their 80% confidence interval. EDIT: No that’s not right; Daniel Filan points out that we wouldn’t expect 14 of 18 to get it in their range. Error between users in a one-off forecast is often correlated.
I was curious how this compares to the Metaculus community forecast (note: not the machine learning fed one, just the simple median prediction). Unfortunately the interface doesn’t tell me the full distribution at date x, it just says what the median was at the time. If the expert central tendency was off by a factor of 6.13, how far off was it for Metaculus?
I looked into it in this document:
Sadly a direct comparison is not really feasible, since we weren’t predicting the same questions. But suppose if all predictions of importance were inputted into platforms such as the Good Judgement Project Open or Metaculus. Then making comparisons between groups could be trivial and continuous. This isn’t even “experts versus non-experts”. The relevant comparison is at the platform-level. It is “untrackable and unworkable one-off PDFs of somebody’s projections” versus proper scoring and aggregation over time. Since Metaculus accounts can be entirely anonymous, why wouldn’t we want every expert to input their forecast into a track record? That would make it possible to find out if the person is a dart-throwing chimp. You should assume half of them are.
Nope. Suppose I roll a 100-sided die, and all LessWrongers write down their centred 80% credible interval for where the answer should fall. If the LWers are rational and calibrated, that interval should be [10,90]. So the actual outcome will fall in everybody’s credible interval or nobody’s. The relevant averaging should happen across questions, not across predictors.
So I thought about it, and I think you’re correct. At least, if their error was correlated and nonrandom (which is very natural in many situations), then we wouldn’t expect 14 of 18 in the group to get it in their range. So you’re right. I can imagine hypotheticals where “group calibration in one-offs” might mean something, but not here now that I think about it.
Instead of that, I should’ve just stuck to pointing out how far out of the ranges many of them were. Rather than the proportion of the group that got it in their range. I.e. suppose an anonymous forecaster places a 0.0000001% probability on some natural macro-scale event having outcome A instead of B, on the next observation. Outcome A happens. That is strong evidence that they weren’t just “incorrect with a lot of error”; they’re probably really uncalibrated too.
Eyeballing the survey chart, many of those 80% CIs are so narrow, they imply the true figure to have been a many-sigma event. That’s really incredible, and I would take it as evidence they’re also uncalibrated (not just wrong in this one-off). That’s a much better way to infer they’re uncalibrated, than the 14 of 18 thing.
Semi-related: I’m unclear about a detail in your dice example. You say a rational and calibrated interval “should be [10,90]” for the dice rolling. Why? A calibrated-over-time 80% confidence interval on such dice could be placed anywhere (e.g. [1,80] or [21,100], so long as they are 80 units wide.
Yes, but the calibrated and centered interval is uniquely [10, 90].