SimonM comments on A Framework of Prediction Technologies

SimonM 4 Oct 2021 10:31 UTC

1 point

tl;dr—I don’t believe the Metaculus prediction is materially better than the community median.

Another example is Metaculus Prediction, an ML algorithm that calibrates and weights each forecaster’s prediction after training on forecaster-level predictions and track records. From 2015 to 2021, it outperformed the median forecast in the Metaculus community by 24% on binary questions and by 9% on continuous questions.

This is at best a misleading way of describing the performance of the Metaculus prediction vs the community (median) prediction.

We can slice the data in any number of ways, and I can’t find any way to suggest the Metaculus prediction outperformed the median prediction by 24%.

Looking at (for all time):

Brier	Resolve	Close	All Times
Community median	0.121	0.123	0.153
Metaculus	0.116	0.116	0.146
Difference	4.3%	6.0%	4.8%

Log	Resolve	Close	All
Community median	0.42	0.412	0.274
Metaculus	0.431	0.431	0.295
Difference	2.6%	4.6%	7.7%

None of these are close to 24%. I also think that given the Metaculus algorithm only came into existence in June 2017, we should really only look at performance more recently. For example, the same table looking at everything from July 2018 onwards looks like:

Brier	Resolve	Close	All Times
Community median	0.107	0.105	0.147
Metaculus	0.108	0.113	0.156
Difference	-0.9%	-7.1%	-5.8%

Log	Resolve	Close	All
Community median	0.462	0.463	0.26
Metaculus	0.448	0.426	0.226
Difference	-3.0%	-8.0%	-13.1%

Now the community median outperforms every time!

For continuous questions the Metaculus forecast has more consistently outperformed out-of-sample, but still smaller differences than what you’ve claimed:

Continuous	Resolve	Close	All
Community	2.26	2.22	1.69
Metaculus	2.32	2.32	1.74
Difference	2.7%	4.5%	3.0%


Continuous (July ’18 -)	Resolve	Close	All
Community	2.28	2.27	1.73
Metaculus	2.35	2.38	1.79
Difference	3.1%	4.8%	3.5%

I would also note that %age difference here is almost certainly the wrong metric for measuring the difference between Brier scores.

isaduan 9 Oct 2021 17:53 UTC
1 point
Parent
Thanks for checking! I think our main difference is that you use data from Metaculus prediction whereas I used Metaculus postdiction, which “uses data from all other questions to calibrate its result, even questions that resolved later.” Right now, this gives Metaculus an average log score of 0.519 vs. the community’s 0.419 (total questions: 885) for binary questions, 2.43 vs. 2.25 for 537 continuous questions, evaluated at resolve time.