This is a bit complicated, but to start, we can answer this question related to only the types of questions we have empirical data from superforecasters about. That’s because the fact that superforecasters do better is an empirical observation, not a clear predictive/quantitative theory about what makes people better or worse. I’m going to use data from the AIImpacts blog post - https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/ - because I don’t have the book or the datasets handy right now.
The original tournament was about short and medium term geopolitical and similar questions. The scoring used time-weighted brier scores, and note that brier scores themselves are question-set specific. For these questions, an aggregate of superforecaster predictions had about 60-70% lower brier scores than the control group of “regular” forecasters. The best superforecaster had a score of 0.14, while the no-skill brier score on these questions, which is if someone just assigns equal probability to everything, is 0.53. But that’s not the right comparison if comparing superforcasting to forecasters. The average of forecasters (including supreforecasters, it seems) was close to 0.35. If we adjust to 0.4 to roughly remove superforecasters, 65% lower than that is 0.14 - the same score as the best superforecaster.
How is that possible? Aggregation. And the benefits of aggregation aren’t due to the skill of superforecasters, they are due to the law of large numbers—so maybe we don’t want to give the superforecasters credit for being better, but superforecasting does include it.
So how do we understand a brier score? It’s the average squared distance from being correct, i.e. 1 or 0. That means that a brier score of .14 means that on average, you predicted things that did / did not happen were 65% / 35% likely. But we had a time-weighted average score—if someone predicted 50% on day 1, and went steadily down to 20% at the close of the question, and it resolves negatively, my average prediction is 35%, and my brier score is 0.14.
This is a bit complicated, but to start, we can answer this question related to only the types of questions we have empirical data from superforecasters about. That’s because the fact that superforecasters do better is an empirical observation, not a clear predictive/quantitative theory about what makes people better or worse. I’m going to use data from the AIImpacts blog post - https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/ - because I don’t have the book or the datasets handy right now.
The original tournament was about short and medium term geopolitical and similar questions. The scoring used time-weighted brier scores, and note that brier scores themselves are question-set specific. For these questions, an aggregate of superforecaster predictions had about 60-70% lower brier scores than the control group of “regular” forecasters. The best superforecaster had a score of 0.14, while the no-skill brier score on these questions, which is if someone just assigns equal probability to everything, is 0.53. But that’s not the right comparison if comparing superforcasting to forecasters. The average of forecasters (including supreforecasters, it seems) was close to 0.35. If we adjust to 0.4 to roughly remove superforecasters, 65% lower than that is 0.14 - the same score as the best superforecaster.
How is that possible? Aggregation. And the benefits of aggregation aren’t due to the skill of superforecasters, they are due to the law of large numbers—so maybe we don’t want to give the superforecasters credit for being better, but superforecasting does include it.
So how do we understand a brier score? It’s the average squared distance from being correct, i.e. 1 or 0. That means that a brier score of .14 means that on average, you predicted things that did / did not happen were 65% / 35% likely. But we had a time-weighted average score—if someone predicted 50% on day 1, and went steadily down to 20% at the close of the question, and it resolves negatively, my average prediction is 35%, and my brier score is 0.14.