There’s a similar challenge in sports with evaluating athletes’ performance. Some pieces of what happens there:
There are many different metrics to summarize/evaluate a player’s performance rather than just one score (e.g., see all the tables here). Many are designed to evaluate a particular aspect of a player’s performance rather than how well they did overall, and there are also multiple attempts to create a single comprehensive overall rating. Over the past decade or two there have been a bunch of improvements in this, with more and better metrics, including metrics that incorporate different sources of information.
There common features of different stats that people who follow the analytics are aware of, such as whether they’re volume stats (number of successes) or efficiency stats (number of successes per attempt). Some metrics attempt to adjust for factors that aren’t under the player’s control which can influence the numbers, such as the quality of the opponent, the quality of the player’s teammates, the environment of the game (weather & stadium), various sources of randomness, and whether the play happened in “garbage time” (when the game was already basically decided).
Payment is based on negotiations with the people who benefit from the player’s performance (their team’s owners) rather than being directly dependent on their stats. Their stats do play into the decision, as do other thing such as close examinations of particular plays that they made.
The awards for individual performance that people care about the most (e.g., All-Star teams, MVP awards, Hall of Fame) are based on voting (by various pools of voters) rather than being directly based on the statistics. Though again, they’re influenced by the statistics and tend to line up pretty closely with the statistics.
The achievements that people care about the most (e.g., winning games & championships) are team achievements rather than individual achievements. In a typical league there might be 30 teams which each have 20 players, and there’s a mix of competitiveness between teams and cooperativeness within a team.
Seems like forecasting might benefit from copying some parts of this. For example, instead of having one leaderboard with an overall forecasting score, have several leaderboards for different ways of evaluating forecasts, along with tables where you can compare forecasters on a bunch of them at once and a page for each forecaster where you can see all their metrics, how they rank, and maybe some other stuff like links to their most upvoted comments.
Here’ s a brainstorm of some possible forecasting metrics which might go in those tables (probably I’m reinventing some wheels here; I know more about existing metrics for sports than for forecasting):
Leading Indicator: get credit for making predictions if the consensus then moves in the same direction over the next hours / days / n predictions (alternate version: only if that movement winds up being towards the true outcome)
Points Relative to Your Expectation: each forecast has an expected score according to that forecast (e.g., if the consensus is 60% and you say 80%, you think there’s a 0.8 chance you’ll gain points for doing better than the consensus and a 0.2 chance you’ll lose points for doing worse than consensus). Report expected score alongside actual score, or report the ratio actual/expected. If that ratio is > 1, that means you’ve been underconfident or (more likely) lucky. Also, expected score is similar to “total number of forecasts”, weighted by boldness of forecasts. You could also have a column for the consensus expected score (in the example: your expected score if there was only a 0.6 chance you’d gain points and a 0.4 chance you’d lose points).
Marginal Contribution to Collective Forecast: have some way of calculating the overall collective forecast on each question (which could be just a simple average, or could involve fancier stuff to try to make it more accurate including putting more weight on some people’s forecasts than others). Also calculate what the overall collective forecast would have been if you’d been absent from that question. You get credit for the size of the difference between those two numbers. (Alternative versions: you only get credit if you moved the collective forecast in the right direction, or you get negative credit if you moved it in the wrong direction.)
Trailblazer Score: use whichever forecasting accuracy metric (e.g. brier score relative to consensus) while only including cases where a person’s forecast differed from the consensus at the time by at least X amount. Relevant in part because there might be different skillsets to noticing that the consensus seems off and adjusting a bit in the right direction vs. coming up with your own forecast and trusting it even if it’s not close to consensus. (And the latter skillset might be relevant if you’re making forecasts on your own without the benefit of having a platform consensus to start from.)
Market Mover: find some way to track which comments lead to people changing their forecasts. Credit those commenters based on how much they moved the market. (alternative version: only if it moved towards truth)
Pseudoprofit: find some way to transform people’s predictions into hypothetical bets against each other (or against the house), track each person’s total profit & total amount “bet”. (I’m not sure if this to different calculations or if it’s just a different gloss on the same calculations.)
Splits: tag each question, and each forecast, with various features. Tags by topic (coronavirus, elections, technology, etc.), by what sort of event it’s about (e.g. will people accomplish a thing they’re trying to do), by amount of activity on the question, by time till event (short term vs. medium term vs. long term markets), by whether the question is binary or continuous, by whether the forecast was placed early vs. middle vs. late in the duration of the question, etc. Be able to show each scoring table only for the subset of forecasts that fit a particular tag.
Predicted Future Rating: On any metric, you can set up formulas to predict what people will score on that metric over the next (period of time / set of markets). A simple way to do that is to just predict future scores on that metric based on past scores on the same metric, with some regression towards the mean, using historical data to estimate the relationship. But there are also more complicated things using past performance on some metrics (especially less noisy ones) to help predict future performance on other metrics. And also analyses to check whether patterns in past data are mostly signal or noise (e.g. if a person appears to have improved over time, or if they have interesting splits). (Finding a way to predict future scores is a good way to come up with a comprehensive metric, since it involves finding an underlying skill from among the noise. And the analyses can also provide information about how important different metrics are, which ones to include in the big table, which ones to make more prominent.)
This is tractable in sports because there are millions of dollars on the line for each player. In most contexts, the costs of negotiation and running a market for talent doesn’t work as well, and it’s better to use simple metrics despite all the very important problems with poorly aligned metrics. (Of course, the better solution is to design better metrics; https://mpra.ub.uni-muenchen.de/98288/ )
The full-blown process of in-depth contract negotiations, etc., is presumably beyond the scope of the current competitive forecasting arena.
One of the main things that I get out of the sports comparison is that it points to a different way of using (and thinking of) metrics. The obvious default, with forecasting, is to think of metrics as possible scoring rules, where the person with the highest score wins the prize (or appears first on the leaderboard). In that case, it’s very important to pick a good metric, which provides good incentives. An alternative is to treat human judgment as primary, whether that means a committee using its judgment to pick which forecasters win prizes, or forecasters voting on an all-star team, or an employer trying to decide who to hire to do some forecasting for them, or just who has street cred in the forecasting community. And metrics are a way to try to help those people be more informed about forecasters’ abilities & performance, so that they’ll make better judgment. In that case, the standards for what is a good metric to include are very different. (There’s also a third use case for metrics, where the forecaster uses metrics about their own performance to try to get better at forecasting.)
Sports also provide an example of what this looks like in action, what sorts of stats exist, how they’re presented, who came up with them, what sort of work went into creating them, how they evaluate different stats and decide which ones to emphasize, etc. And it seems plausible that similar work could be done with forecasting, since much of that work was done by sports fans who are nerds rather than by the teams; forecasting has fewer fans but a higher nerd density. I did some brainstorming in another comment on some potential forecasting stats which draws a lot of inspiration from that; not sure how much of it is retreading familiar ground.
There’s a similar challenge in sports with evaluating athletes’ performance. Some pieces of what happens there:
There are many different metrics to summarize/evaluate a player’s performance rather than just one score (e.g., see all the tables here). Many are designed to evaluate a particular aspect of a player’s performance rather than how well they did overall, and there are also multiple attempts to create a single comprehensive overall rating. Over the past decade or two there have been a bunch of improvements in this, with more and better metrics, including metrics that incorporate different sources of information.
There common features of different stats that people who follow the analytics are aware of, such as whether they’re volume stats (number of successes) or efficiency stats (number of successes per attempt). Some metrics attempt to adjust for factors that aren’t under the player’s control which can influence the numbers, such as the quality of the opponent, the quality of the player’s teammates, the environment of the game (weather & stadium), various sources of randomness, and whether the play happened in “garbage time” (when the game was already basically decided).
Payment is based on negotiations with the people who benefit from the player’s performance (their team’s owners) rather than being directly dependent on their stats. Their stats do play into the decision, as do other thing such as close examinations of particular plays that they made.
The awards for individual performance that people care about the most (e.g., All-Star teams, MVP awards, Hall of Fame) are based on voting (by various pools of voters) rather than being directly based on the statistics. Though again, they’re influenced by the statistics and tend to line up pretty closely with the statistics.
The achievements that people care about the most (e.g., winning games & championships) are team achievements rather than individual achievements. In a typical league there might be 30 teams which each have 20 players, and there’s a mix of competitiveness between teams and cooperativeness within a team.
Seems like forecasting might benefit from copying some parts of this. For example, instead of having one leaderboard with an overall forecasting score, have several leaderboards for different ways of evaluating forecasts, along with tables where you can compare forecasters on a bunch of them at once and a page for each forecaster where you can see all their metrics, how they rank, and maybe some other stuff like links to their most upvoted comments.
Here’ s a brainstorm of some possible forecasting metrics which might go in those tables (probably I’m reinventing some wheels here; I know more about existing metrics for sports than for forecasting):
Leading Indicator: get credit for making predictions if the consensus then moves in the same direction over the next hours / days / n predictions (alternate version: only if that movement winds up being towards the true outcome)
Points Relative to Your Expectation: each forecast has an expected score according to that forecast (e.g., if the consensus is 60% and you say 80%, you think there’s a 0.8 chance you’ll gain points for doing better than the consensus and a 0.2 chance you’ll lose points for doing worse than consensus). Report expected score alongside actual score, or report the ratio actual/expected. If that ratio is > 1, that means you’ve been underconfident or (more likely) lucky. Also, expected score is similar to “total number of forecasts”, weighted by boldness of forecasts. You could also have a column for the consensus expected score (in the example: your expected score if there was only a 0.6 chance you’d gain points and a 0.4 chance you’d lose points).
Marginal Contribution to Collective Forecast: have some way of calculating the overall collective forecast on each question (which could be just a simple average, or could involve fancier stuff to try to make it more accurate including putting more weight on some people’s forecasts than others). Also calculate what the overall collective forecast would have been if you’d been absent from that question. You get credit for the size of the difference between those two numbers. (Alternative versions: you only get credit if you moved the collective forecast in the right direction, or you get negative credit if you moved it in the wrong direction.)
Trailblazer Score: use whichever forecasting accuracy metric (e.g. brier score relative to consensus) while only including cases where a person’s forecast differed from the consensus at the time by at least X amount. Relevant in part because there might be different skillsets to noticing that the consensus seems off and adjusting a bit in the right direction vs. coming up with your own forecast and trusting it even if it’s not close to consensus. (And the latter skillset might be relevant if you’re making forecasts on your own without the benefit of having a platform consensus to start from.)
Market Mover: find some way to track which comments lead to people changing their forecasts. Credit those commenters based on how much they moved the market. (alternative version: only if it moved towards truth)
Pseudoprofit: find some way to transform people’s predictions into hypothetical bets against each other (or against the house), track each person’s total profit & total amount “bet”. (I’m not sure if this to different calculations or if it’s just a different gloss on the same calculations.)
Splits: tag each question, and each forecast, with various features. Tags by topic (coronavirus, elections, technology, etc.), by what sort of event it’s about (e.g. will people accomplish a thing they’re trying to do), by amount of activity on the question, by time till event (short term vs. medium term vs. long term markets), by whether the question is binary or continuous, by whether the forecast was placed early vs. middle vs. late in the duration of the question, etc. Be able to show each scoring table only for the subset of forecasts that fit a particular tag.
Predicted Future Rating: On any metric, you can set up formulas to predict what people will score on that metric over the next (period of time / set of markets). A simple way to do that is to just predict future scores on that metric based on past scores on the same metric, with some regression towards the mean, using historical data to estimate the relationship. But there are also more complicated things using past performance on some metrics (especially less noisy ones) to help predict future performance on other metrics. And also analyses to check whether patterns in past data are mostly signal or noise (e.g. if a person appears to have improved over time, or if they have interesting splits). (Finding a way to predict future scores is a good way to come up with a comprehensive metric, since it involves finding an underlying skill from among the noise. And the analyses can also provide information about how important different metrics are, which ones to include in the big table, which ones to make more prominent.)
Cheers, thanks! These are great
This is tractable in sports because there are millions of dollars on the line for each player. In most contexts, the costs of negotiation and running a market for talent doesn’t work as well, and it’s better to use simple metrics despite all the very important problems with poorly aligned metrics. (Of course, the better solution is to design better metrics; https://mpra.ub.uni-muenchen.de/98288/ )
The full-blown process of in-depth contract negotiations, etc., is presumably beyond the scope of the current competitive forecasting arena.
One of the main things that I get out of the sports comparison is that it points to a different way of using (and thinking of) metrics. The obvious default, with forecasting, is to think of metrics as possible scoring rules, where the person with the highest score wins the prize (or appears first on the leaderboard). In that case, it’s very important to pick a good metric, which provides good incentives. An alternative is to treat human judgment as primary, whether that means a committee using its judgment to pick which forecasters win prizes, or forecasters voting on an all-star team, or an employer trying to decide who to hire to do some forecasting for them, or just who has street cred in the forecasting community. And metrics are a way to try to help those people be more informed about forecasters’ abilities & performance, so that they’ll make better judgment. In that case, the standards for what is a good metric to include are very different. (There’s also a third use case for metrics, where the forecaster uses metrics about their own performance to try to get better at forecasting.)
Sports also provide an example of what this looks like in action, what sorts of stats exist, how they’re presented, who came up with them, what sort of work went into creating them, how they evaluate different stats and decide which ones to emphasize, etc. And it seems plausible that similar work could be done with forecasting, since much of that work was done by sports fans who are nerds rather than by the teams; forecasting has fewer fans but a higher nerd density. I did some brainstorming in another comment on some potential forecasting stats which draws a lot of inspiration from that; not sure how much of it is retreading familiar ground.