Comparing brier score between different question sets is not meaningful (intuitive example: Manifold hurts its Brier score with every daily coinflip market, and greatly improves its Brier score with every d20 die roll market, but both identically demonstrate zero predictive insight) [1]. You cannot call 0.195 good or bad or anything in between—Brier score is only useful when comparing on a shared question set.
The linked replication addresses this (same as the original paper)—the relevant comparison is the crowd Brier score of 0.141. For intuition, the gap between the crowd Metaculus Brier score of 0.141 & the AI’s 0.195 is roughly as large as the gap between 0.195 & 0.25 (the result if you guess 50% for all questions). So the claim of the replication is quite conclusive (the AI did far worse than the Metaculus crowd), the question is just whether that replication result is itself accurate.
[1]. Yes, Manifold reports this number on their website, and says it is “very good”—as a Manifold addict I would strongly encourage them to not do this. When I place bets on an event that already happened (which is super common), the Brier score contribution from that bet is near zero, i.e. impossibly good. And if I make a market that stays near 50% (also super common, e.g. if I want to maximize liquidity return), all the bets on that market push the site-wide Brier score towards the maximally non-predictive 0.25.
Slightly tangential, but do you know what the correct base rate of Manifold binary questions are? Like is it closer to 30% or closer to 50% for questions that resolve Yes?
Manifold claims a brier score of 0.17 and says it’s “very good” https://manifold.markets/calibration
Prediction markets in general don’t score much better https://calibration.city/accuracy . I wouldn’t say 0.195 is “so bad”
Comparing brier score between different question sets is not meaningful (intuitive example: Manifold hurts its Brier score with every daily coinflip market, and greatly improves its Brier score with every d20 die roll market, but both identically demonstrate zero predictive insight) [1]. You cannot call 0.195 good or bad or anything in between—Brier score is only useful when comparing on a shared question set.
The linked replication addresses this (same as the original paper)—the relevant comparison is the crowd Brier score of 0.141. For intuition, the gap between the crowd Metaculus Brier score of 0.141 & the AI’s 0.195 is roughly as large as the gap between 0.195 & 0.25 (the result if you guess 50% for all questions). So the claim of the replication is quite conclusive (the AI did far worse than the Metaculus crowd), the question is just whether that replication result is itself accurate.
[1]. Yes, Manifold reports this number on their website, and says it is “very good”—as a Manifold addict I would strongly encourage them to not do this. When I place bets on an event that already happened (which is super common), the Brier score contribution from that bet is near zero, i.e. impossibly good. And if I make a market that stays near 50% (also super common, e.g. if I want to maximize liquidity return), all the bets on that market push the site-wide Brier score towards the maximally non-predictive 0.25.
Slightly tangential, but do you know what the correct base rate of Manifold binary questions are? Like is it closer to 30% or closer to 50% for questions that resolve Yes?