Maybe “sidestep the data leakage issue” then. The series was designed with these issues in mind. (I work at Metaculus.)
ChristianWilliams
Forecast 2025 With Vox’s Future Perfect Team — $2,500 Prize Pool
Forecast With GiveWell
Predictions as Public Works Project — What Metaculus Is Building Next
Can AI Outpredict Humans? Results From Metaculus’s Q3 AI Forecasting Benchmark
Metaculus Is Open Source
Comparing Forecasting Track Records for AI Benchmarking and Beyond
Metaculus’s ‘Minitaculus’ Experiments — Collaborate With Us
Launching the Respiratory Outlook 2024/25 Forecasting Series
Launching the AI Forecasting Benchmark Series Q3 | $30k in Prizes
Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes
Open Sourcing Metaculus
Metaculus Hosts ACX 2024 Prediction Contest
Metaculus Introduces Multiple Choice Questions
Metaculus Launches Chinese AI Chips Tournament, Supporting Institute for AI Policy and Strategy Research
Hi @Odd anon, thanks for the feedback and questions.
1. To your point about copying the Community Prediction: It’s true that if you copy the CP at all times you would indeed receive a high Baseline Accuracy score. The CP is generally a great forecast! Now, CP hidden periods do mitigate this issue somewhat. We are monitoring user behavior on this front, and will address it if it becomes an issue. We do have some ideas in our scoring trade-offs doc for further ways to address CP copying, e.g.:We could have a leaderboard that only considers the last prediction made before the hidden period ends to calculate Peer scores. This largely achieves the goal above: it rewards judgement, it does not require updates or tracking the news constantly. It does not reward finding stale questions.
Have a look here, and let us know what you think! (We also have some ideas we’re tinkering with that are not listed in that doc, like accuracy metrics that don’t include forecasts that are on the CP or +/- some delta.)
2. On indicating confidence: You’ll see in the tradeoffs doc that we’re also considering the idea of letting users exclude a particular forecast from their peer score (Idea # 3), which could somewhat address this. (Interestingly, indicating confidence was attempted at Good Judgment Project, but ultimately didn’t work and was abandoned.)We’re continuing to develop ideas on the above, and we’d definitely welcome further feedback!
Hi @gwern, we are currently in the process of combing through winners’ documentation of their bots and which models they used. We haven’t yet encountered anyone who claims to have used one of the base models.
We will share here if we learn a participant did indeed use one.