Metaculus points are not money, so positive points on a question doesn’t mean you’re a top predictor. However, they aren’t meaningless either. It’s about winning MORE points than the competition to win on the leaderboards. The incentive system is good for that (though there are some minor issues with variance-increasing strategies or questions with asymmetrical resolution timelines).
The thing that I was more surprised by, looking at the scoring system, is that Metaculus is set up as a platform for maintaining a forecast rather than as a place where you make a forecast at a particular time. (If I’m understanding the scoring correctly.)
Metaculus scores your current forecast at each moment, from the moment you first enter a forecast on the question until the moment the question closes. Where “your current forecast” at each moment is the most recent number that you entered, and the only thing that happens when you enter an updated prediction is that for the rest of the moments (until you update it again) “your current forecast” will be a different number. Every moment gets equal weight regardless of whether you last entered a number just now or three weeks ago (except that the very last moment when the question closes gets extra weight).
So it’s not like a literal betting market where you’re buying at the current market price at the moment that you make your forecast. If you don’t keep updating your forecast, then you-at-that-moment is going up against the future consensus forecast.
So the scoring system rewards the activity of entering more questions, and also the activity of updating your forecasts on each of those questions again and again to keep them up-to-date.
The problem is that metaculus points reward some non-obvious combination of making good predictions and being active on the platform. I only care about the first of those, so the current points system doesn’t help me much.
I can’t look at a user’s points score and figure out how much I should trust their predictions. Or possibly I could, but only by diving into the small print of how scoring works.
I say that as somebody who uses metaculus and believes it has potential. The points system is definitely a weak point
There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
Someone who is near the top of the leaderboard is both accurate and highly experienced
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily top 3 (edit: top 10)). See my comment below for more details.
You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other.
I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/
You don’t care, but if the goal is to motivate better communal predictions, giving people the incentive to do more predicting seems to make far more sense than having it normed to sum to zero, which would mean that in expectation you only gain points when you outperform the community.
This seems to me to be very non-obvious. Do we want more low-quality low-effort predictions, or less high-quality high-effort predictions? Do we want people to go for the exact correct probability as they see it, or give a shove in the direction they feel strongly about? Do we want people to go around making the actual community prediciton to bank free points? Who will free points motivate versus demotivate? What about the question of who to trust, and whether others would update their models based on the predictions of those who are doing well? Etc.
If I have time a post on the subject would be interesting. Curious if there are writings detailing how it works and the reasoning behind it, or if you’d like to talk about it in a video call or LW meetup, or both.
The scoring system incentivizes predicting your true credence, (gory details here).
I think Metaculus rewarding participation is one of the reasons it has participation. Metaculus can discriminate good predictors from bad predictors because it has their track record (I agree this is not the same as discriminating good/bad predictions). This info is incorporated in the Metaculus prediction, which is hidden by default, but you can unlock with on-site fake currency.
I think Metaculus rewarding participation is one of the reasons it has participation.
PredictionBook also had participation while being public about people’s Brier’s scores. I think the main reason Metaculus has more activity is that it has good curated questions.
There’s also no reason to only have a single public metric. Being able to achieve something like the Superforcaster status on the Good Judgement Project would be valuable to motivate some people.
There was a lesswrong post about this a while back that I can’t find right now, and I wrote a twitter thread on a related topic. I’m not involved with the reasoning behind the structure for GJP or Metaculus, so for both it’s an outside perspective. However, I was recently told there is a significant amount of ongoing internal metaculus discussion about the scoring rule, which, I think, isn’t nearly as bad as it seemed. (But even if there is a better solution, changing the rule now would have really weird impacts on motivation of current users, which is critical to the overall forecast accuracy, and I’m not sure it’s worthwhile for them.)
Given all of that, I’d be happy to chat, or even do a meetup on incentives for metrics and issues generally, but I’m not sure I have time to put together my thoughts more clearly in the next month. But I’d think Ozzie Gooen has even more to usefully say on the topic. (Thinking about it, I’d be really interested in being on or watching a panel discussion of the topic—which would probably make an interesting event.)
So one should interpret the points as a measure of how useful you’ve been to the overall predictions in the platform, and not how good you should be expected to be on a specific question, right?
Not really. Overall usefulness is really about something like covariance with the overall prediction—are you contributing different ideas and models. That would be very hard to measure, while making the points incentive compatible is not nearly as hard to do.
And how well an individual predictor will do, based on historical evidence, is found in comparing their brier to the metaculus prediction on the same set of questions. This is information which users can see on their own page. But it’s not a useful figure unless you’re asking about relative performance, which as an outsider interpreting predictions, you shouldn’t care about—because you want the aggregated prediction.
So, to “win” I need to participate in every possible market, regardless of my own knowledge, as long as I can make a prediction with positive value regardless of outcome (or at least a hugely favorable spread for going with the consensus?
That sounds like a flaw.
Yes, but it doesn’t take much time to just predict the community median when you don’t have a clue about a question and don’t want to take the time for getting into it. However, as another commenter points out, this means that Metaculus is rewarding a combination of time put in + prediction skills, rather than just prediction skills.
What are you hoping to “win”? This isn’t a market—you don’t need your relative performance to be better than someone else’s to have done well. And giving people points for guessing the community prediction is valuable, since it provides evidence that they don’t have marginal information that causes them to believe something different. If people only predict when they are convinced they know significantly more than others, there would be far fewer predictions.
The wording here makes me worry we’re Goodharting on quantity of predictions. And the best way to predict the community prediction is to (of course) wait for others to predict first, then match them...
If the user is interested in getting into the top ranks, this strategy won’t be anything like enough. And if not, but they want to maximize their score, the scoring system is still incentive compatible—they are better off reporting their true estimate on any given question. And for the worst (but still self-aware) predictors, this should be the metaculus prediction anyways—so they can still come away with a positive number of points, but not many. Anything much worse than that, yes, people could have negative overall scores—which, if they’ve predicted on a decent number of questions, is pretty strong evidence that they really suck at forecasting.
Looking at my track record, for questions resolved in the last 3 months, evaluated at all times, here’s how my log score looks compared to the community:
Binary questions (N=19): me: -.072 vs. community: -.045
Continuous questions (N=20): me: 2.35 vs. community: 2.33
So if anything, I’ve done a bit worse than the community overall, and am in 5th by virtue of predicting on all questions. It’s likely that the predictors significantly in front of me are that far ahead in part due to having predicted on (a) questions that have resolved recently but closed before I was active and (b) a longer portion of the lifespan for questions that were open before I became active.
Edit:
I discovered that the question set changes when I evaluate at “resolve time” and filter for the past 3 months, not sure why exactly. Numbers at resolve time:
Binary questions (N=102): me: .598 vs. community: .566
Continuous questions (N=92): me: 2.95 vs. community: 2.86
I think this weakens my case substantially, though I still think a bot that just predicts the community as soon as it becomes visible and updates every day would currently be at least top 10.
Anything much worse than that, yes, people could have negative overall scores—which, if they’ve predicted on a decent number of questions, is pretty strong evidence that they really suck at forecasting
I agree that this should have some effect of being less welcoming to newcomers, but I’m curious to what extent. I have seen plenty of people with worse brier scores than the median continuing to predict on GJO rather than being demoralized and quitting (disclaimer: survivorship bias).
I think that viewing it as a competition to place highly on the leaderboards is misleading, and perhaps even damaging.
I’d think the better framing for metaculus points is that they are like money—you are being paid to predict, on net, and getting more money is better. The fact that the leaderboard has someone with a billion points, because they have been participating for years, is kind-of irrelevant, and misleading.
In fact, I’d like to see metaculus points actually be convertible to money at some point in some form—and yes, this would require a net cost (in dollars) to post a new question, and have the pot of money divided proportionate to the total points gained on the question—with negative points coming out of a users’ balance. (And this would do a far better job aligning incentives on questions than the current leaderboard system, since for a leaderboard system, proper scoring rules for points are not actually incentive compatible.)
The fact that the leaderboard has someone with a billion points, because they have been participating for years, is kind-of irrelevant, and misleading.
There are many leaderboards, including ones that only consider questions that opened recently. Or tournaments with a distinct start and end date.
(And this would do a far better job aligning incentives on questions than the current leaderboard system, since for a leaderboard system, proper scoring rules for points are not actually incentive compatible.)
This is true, but you can create leaderboards that minimize the incentive to use variance-increasing strategies (or variance-decreasing ones if you’re in the lead). (Basically just include a lot of questions so that variance-increasing strategies will most likely backfire, and then have gradually increasing payouts for better rankings.)
I agree that what you describe sounds ideal, and maybe it makes sense for Metaculists to think of the points in that way. For making it a reality, I worry that it would cost a lot. (And you’d need a solution against the problem that everyone who wants a few extra dollars could create an account to predict the community median on every question to get some fraction of the total prize pool for just that.)
If points could be converted to money enough to motivate real predictions, I would expect a flood of people who do nothing but information cascade to bank points, and it’s not obvious what to do about that. As it is, it felt (to me) like there was a tension between ‘score points’ and ‘make good predictions or at least don’t make noise predictions’ and that felt like a dealbreaker.
I agree that actually offering money would require incentives to avoid, essentially, sybil attacks. But making sure people don’t make “noise predictions” isn’t a useful goal—those noise predictions don’t really affect the overall metaculus prediction much, since it weights past accuracy.
Metaculus incentive system is that the more prediction you make the more points you will get. If you know nothing about a question you are still incentivised to predict it.
Metaculus points are not money, so positive points on a question doesn’t mean you’re a top predictor. However, they aren’t meaningless either. It’s about winning MORE points than the competition to win on the leaderboards. The incentive system is good for that (though there are some minor issues with variance-increasing strategies or questions with asymmetrical resolution timelines).
The thing that I was more surprised by, looking at the scoring system, is that Metaculus is set up as a platform for maintaining a forecast rather than as a place where you make a forecast at a particular time. (If I’m understanding the scoring correctly.)
Metaculus scores your current forecast at each moment, from the moment you first enter a forecast on the question until the moment the question closes. Where “your current forecast” at each moment is the most recent number that you entered, and the only thing that happens when you enter an updated prediction is that for the rest of the moments (until you update it again) “your current forecast” will be a different number. Every moment gets equal weight regardless of whether you last entered a number just now or three weeks ago (except that the very last moment when the question closes gets extra weight).
So it’s not like a literal betting market where you’re buying at the current market price at the moment that you make your forecast. If you don’t keep updating your forecast, then you-at-that-moment is going up against the future consensus forecast.
So the scoring system rewards the activity of entering more questions, and also the activity of updating your forecasts on each of those questions again and again to keep them up-to-date.
The problem is that metaculus points reward some non-obvious combination of making good predictions and being active on the platform. I only care about the first of those, so the current points system doesn’t help me much.
I can’t look at a user’s points score and figure out how much I should trust their predictions. Or possibly I could, but only by diving into the small print of how scoring works.
I say that as somebody who uses metaculus and believes it has potential. The points system is definitely a weak point
There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily
top 3(edit: top 10)). See my comment below for more details.I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/
You don’t care, but if the goal is to motivate better communal predictions, giving people the incentive to do more predicting seems to make far more sense than having it normed to sum to zero, which would mean that in expectation you only gain points when you outperform the community.
This seems to me to be very non-obvious. Do we want more low-quality low-effort predictions, or less high-quality high-effort predictions? Do we want people to go for the exact correct probability as they see it, or give a shove in the direction they feel strongly about? Do we want people to go around making the actual community prediciton to bank free points? Who will free points motivate versus demotivate? What about the question of who to trust, and whether others would update their models based on the predictions of those who are doing well? Etc.
If I have time a post on the subject would be interesting. Curious if there are writings detailing how it works and the reasoning behind it, or if you’d like to talk about it in a video call or LW meetup, or both.
The scoring system incentivizes predicting your true credence, (gory details here).
I think Metaculus rewarding participation is one of the reasons it has participation. Metaculus can discriminate good predictors from bad predictors because it has their track record (I agree this is not the same as discriminating good/bad predictions). This info is incorporated in the Metaculus prediction, which is hidden by default, but you can unlock with on-site fake currency.
PredictionBook also had participation while being public about people’s Brier’s scores. I think the main reason Metaculus has more activity is that it has good curated questions.
There’s also no reason to only have a single public metric. Being able to achieve something like the Superforcaster status on the Good Judgement Project would be valuable to motivate some people.
There was a lesswrong post about this a while back that I can’t find right now, and I wrote a twitter thread on a related topic. I’m not involved with the reasoning behind the structure for GJP or Metaculus, so for both it’s an outside perspective. However, I was recently told there is a significant amount of ongoing internal metaculus discussion about the scoring rule, which, I think, isn’t nearly as bad as it seemed. (But even if there is a better solution, changing the rule now would have really weird impacts on motivation of current users, which is critical to the overall forecast accuracy, and I’m not sure it’s worthwhile for them.)
Given all of that, I’d be happy to chat, or even do a meetup on incentives for metrics and issues generally, but I’m not sure I have time to put together my thoughts more clearly in the next month. But I’d think Ozzie Gooen has even more to usefully say on the topic. (Thinking about it, I’d be really interested in being on or watching a panel discussion of the topic—which would probably make an interesting event.)
Having a meetup on this seems interesting. Will PM people.
https://www.lesswrong.com/posts/tyNrj2wwHSnb4tiMk/incentive-problems-with-current-forecasting-competitions ?
So one should interpret the points as a measure of how useful you’ve been to the overall predictions in the platform, and not how good you should be expected to be on a specific question, right?
Not really. Overall usefulness is really about something like covariance with the overall prediction—are you contributing different ideas and models. That would be very hard to measure, while making the points incentive compatible is not nearly as hard to do.
And how well an individual predictor will do, based on historical evidence, is found in comparing their brier to the metaculus prediction on the same set of questions. This is information which users can see on their own page. But it’s not a useful figure unless you’re asking about relative performance, which as an outsider interpreting predictions, you shouldn’t care about—because you want the aggregated prediction.
You could also check their track record. It has a calibration curve and much more.
So, to “win” I need to participate in every possible market, regardless of my own knowledge, as long as I can make a prediction with positive value regardless of outcome (or at least a hugely favorable spread for going with the consensus? That sounds like a flaw.
Yes, but it doesn’t take much time to just predict the community median when you don’t have a clue about a question and don’t want to take the time for getting into it. However, as another commenter points out, this means that Metaculus is rewarding a combination of time put in + prediction skills, rather than just prediction skills.
What are you hoping to “win”? This isn’t a market—you don’t need your relative performance to be better than someone else’s to have done well. And giving people points for guessing the community prediction is valuable, since it provides evidence that they don’t have marginal information that causes them to believe something different. If people only predict when they are convinced they know significantly more than others, there would be far fewer predictions.
The wording here makes me worry we’re Goodharting on quantity of predictions. And the best way to predict the community prediction is to (of course) wait for others to predict first, then match them...
If the user is interested in getting into the top ranks, this strategy won’t be anything like enough. And if not, but they want to maximize their score, the scoring system is still incentive compatible—they are better off reporting their true estimate on any given question. And for the worst (but still self-aware) predictors, this should be the metaculus prediction anyways—so they can still come away with a positive number of points, but not many. Anything much worse than that, yes, people could have negative overall scores—which, if they’ve predicted on a decent number of questions, is pretty strong evidence that they really suck at forecasting.
I think this isn’t true empirically for a reasonable interpretation of top ranks. For example, I’m ranked 5th on questions that have resolved in the past 3 months due to predicting on almost every question.
Looking at my track record, for questions resolved in the last 3 months, evaluated at all times, here’s how my log score looks compared to the community:
Binary questions (N=19): me: -.072 vs. community: -.045
Continuous questions (N=20): me: 2.35 vs. community: 2.33
So if anything, I’ve done a bit worse than the community overall, and am in 5th by virtue of predicting on all questions. It’s likely that the predictors significantly in front of me are that far ahead in part due to having predicted on (a) questions that have resolved recently but closed before I was active and (b) a longer portion of the lifespan for questions that were open before I became active.
Edit:
I discovered that the question set changes when I evaluate at “resolve time” and filter for the past 3 months, not sure why exactly. Numbers at resolve time:
Binary questions (N=102): me: .598 vs. community: .566
Continuous questions (N=92): me: 2.95 vs. community: 2.86
I think this weakens my case substantially, though I still think a bot that just predicts the community as soon as it becomes visible and updates every day would currently be at least top 10.
I agree that this should have some effect of being less welcoming to newcomers, but I’m curious to what extent. I have seen plenty of people with worse brier scores than the median continuing to predict on GJO rather than being demoralized and quitting (disclaimer: survivorship bias).
I think you get more points for earlier predictions.
I think that viewing it as a competition to place highly on the leaderboards is misleading, and perhaps even damaging.
I’d think the better framing for metaculus points is that they are like money—you are being paid to predict, on net, and getting more money is better. The fact that the leaderboard has someone with a billion points, because they have been participating for years, is kind-of irrelevant, and misleading.
In fact, I’d like to see metaculus points actually be convertible to money at some point in some form—and yes, this would require a net cost (in dollars) to post a new question, and have the pot of money divided proportionate to the total points gained on the question—with negative points coming out of a users’ balance. (And this would do a far better job aligning incentives on questions than the current leaderboard system, since for a leaderboard system, proper scoring rules for points are not actually incentive compatible.)
There are many leaderboards, including ones that only consider questions that opened recently. Or tournaments with a distinct start and end date.
This is true, but you can create leaderboards that minimize the incentive to use variance-increasing strategies (or variance-decreasing ones if you’re in the lead). (Basically just include a lot of questions so that variance-increasing strategies will most likely backfire, and then have gradually increasing payouts for better rankings.)
I agree that what you describe sounds ideal, and maybe it makes sense for Metaculists to think of the points in that way. For making it a reality, I worry that it would cost a lot. (And you’d need a solution against the problem that everyone who wants a few extra dollars could create an account to predict the community median on every question to get some fraction of the total prize pool for just that.)
If points could be converted to money enough to motivate real predictions, I would expect a flood of people who do nothing but information cascade to bank points, and it’s not obvious what to do about that. As it is, it felt (to me) like there was a tension between ‘score points’ and ‘make good predictions or at least don’t make noise predictions’ and that felt like a dealbreaker.
I agree that actually offering money would require incentives to avoid, essentially, sybil attacks. But making sure people don’t make “noise predictions” isn’t a useful goal—those noise predictions don’t really affect the overall metaculus prediction much, since it weights past accuracy.
Metaculus incentive system is that the more prediction you make the more points you will get. If you know nothing about a question you are still incentivised to predict it.