There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
Someone who is near the top of the leaderboard is both accurate and highly experienced
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily top 3 (edit: top 10)). See my comment below for more details.
You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other.
I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/
There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily
top 3(edit: top 10)). See my comment below for more details.I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/