Perhaps resolving forecasts with expert probabilities can be better than resolving them with the actual events.
The default in literature on prediction markets and decision markets is to expect that resolutions should be real world events instead of probabilistic estimates by experts. For instance, people would predict “What will the GDP of the US in 2025 be?”, and that would be scored using the future “GDP of the US.” Let’s call these empirical resolutions.
These resolutions have a few nice properties:
We can expect expect them to be roughly calibrated. (Somewhat obvious)
They have relatively high precision/sharpness.
While these may be great in a perfectly efficient forecaster market, I think they may be suboptimal for incentivizing forecasters to best estimate important questions given real constraints. A more cost-effective solution could look like having a team of calibrated experts[1] inspect the situation post-event, make their best estimate of the probability pre-event, and then use that for scoring predictions.
A thought Experiment
The intuition here could be demonstrated by a thought experiment. Say you can estimate a probability distribution X. Your prior, and the prior you expect that others has, indicates that 99.99% of X is definitely a uniform distribution, but the last 0.001% tail on the right is something much more complicated.
You could spend a lot of effort better estimating this 0.001% tail, but there is a very small chance this would be valuable to do. In 99.99% of cases, any work you do here will not effect your winnings. Worse, you may need to wager a large amount of money for a long period of time for this possibility of effectively using your tiny but better-estimated tail in a bet.
Users of that forecasting system may care about this tail. They may be willing to pay for improvements in the aggregate distributional forecast such that it better models an enlightened ideal. If it were quickly realized that 99.99% of the distribution was uniform, then any subsidies for information should go to those that did a good job improving the 0.001% tail. It’s possible that some pretty big changes to this tail could be figured out.
Say instead that you are estimating the 0.001% tail, but you know you will be scored against a probability distribution selected by experts post-result, instead of the actual result. Say, these experts get to see all previous forecasts and discussion, so in expectation only respond with a forecast that is more sharp than the aggregate. In this case all of their work will be focused on this tail, so all of the differences in forecasters may come from this sliver.
This setup would require the experts[1] to be calibrated.
Further Work
I’m sure there’s a mathematical representation to better showcase this distinction, and to specify the loss of motivation that traders would have on probabilities that they know will be resolved empirically rather than judgmentally (using the empirical data in these judgements.) There must be something in statistical learning theory or similar that deals with similar problems; for instance, I imagine a classifier may be able to perform better when learning against “enlightened probabilities” instead of “binary outcomes”, as there is more clear signal there.
[1] I use “Experts” here to refer to a group estimated to provide the highly accurate estimates, instead of domain-respected experts.
Here is another point by @jacobjacob, which I’m copying here in order for it not to be lost in the mists of time:
Though just realised this has some problems if you expected predictors to be better than the evaluators: e.g. they’re like “one the event happens everyone will see I was right, but up until then no one will believe me, so I’ll just lose points by predicting against the evaluators” (edited)
Maybe in that case you could eventually also score the evaluators based on the final outcome… or kind of re-compensate people who were wronged the first time…
Users of that forecasting system may care about this tail. They may be willing to pay for improvements in the aggregate distributional forecast such that it better models an enlightened ideal. If it were quickly realized that 99.99% of the distribution was uniform, then any subsidies for information should go to those that did a good job improving the 0.001% tail. It’s possible that some pretty big changes to this tail could be figured out.
I’m really interested in this type of scheme because it would also solve a big problem in futarchy and futarchy-like setups that use prediction polling, namely, the inability to score conditional counterfactuals (which is most of the forecasting you’ll be doing in Futarchy-like setup).
One thing you could do instead of scoring people against expert assesments is also potentially score people against the final aggregate and extremized distribution.
One issue with any framework like this is that general calibration may be very different than calibration at the tails. Whatever scoring rule you’re using to determine calibration of experts or aggregate scoring has the same issue that long tail events rarely happen.
Another solution to this problem (although it doesn’t solve the counterfactual conditional problem) is to create tailored scoring rules that provide extra rewards for events at the tails. If an event at the tails is a million times less likely to happen, but you care about it equally to events at the center, then provide a million times reward for accuracy near the tail in the event it happens. Prior work on tailored scoring rules for different utility functions here: https://www.evernote.com/l/AAhVczys0ddF3qbfGk_s4KLweJm0kUloG7k/
One thing you could do instead of scoring people against expert assessments is also potentially score people against the final aggregate and extremized distribution.
I think that an efficient use of expert assessments would be for them to see the aggregate, and then basically adjust that as is necessary, but to try to not do much original research. I just wrote a more recent shortform post about this.
One issue with any framework like this is that general calibration may be very different than calibration at the tails.
I think that we can get calibration to be as good as experts can figure out, and that could be enough to be really useful.
Another point in favor of such a set-up would be that aspiring superforecasters get much, much more information when they see ~[the prediction of a superforecaster would have made having their information]; a point vs a distribution. I’d expect that this means that market participants would get better, faster.
You can train experts to be calibrated in different ways. If you train experts to be calibrated to pick the right probability on GPOpen where probability is done in steps on 1, I don’t think those experts will be automatically calibrated to distinguish a p=0.00004 event from a p=0.00008.
Experts would actually need to be calibrated on getting probabilities inside the tail right. I don’t think we know how to do calibration training for that tail.
I think this could be a good example for what I’m getting at. I think there are definitely some people in some situations who can distinguish a p=0.00004 event from a p=0.00008 event. How? By making a Fermi model or similar.
A trivial example would be a lottery with calculable odds of success. Just because the odds are low doesn’t mean they can’t be precisely estimated.
I expect that the kinds of problems that GPOpen would consider asking AND are incredibly unlikely, would be difficult to estimate within 1 order of magnitude. But may still be able to do a decent job, especially in cases where you can make neat Fermi models.
However, of course, it seems very silly to use the incentive mechanism “you’ll get paid once we know for sure if the event happened” on such an event. Instead, if resolutions are done with evaluators, then there is much more of a signal.
I’m fairly skeptical of this. From a conceptual perspective, we expect the tails to be dominated by unknown unknowns and black swans. Fermi estimates and other modelling tools are much better at estimating scenarios that we expect. Whereas, if we find ourselves in the extreme tails, its often because of events or factors that we failed to model.
From a conceptual perspective, we expect the tails to be dominated by unknown unknowns and black swans.
I’m not sure. The reasons things happen at the tails typically fall into categories that could be organized to be a small set.
For instance:
The question wasn’t understood correctly.
A significant exogenous event happened.
But, as we do a bunch of estimates, we could get empirical data about these possibilities, and estimate the potentials for future tails.
This is a bit different to what I was mentioning, which was more about known but small risks. For instance, the “amount of time I spend on my report next week” may be an outlier if I die. But the chance of serious accident or death can be estimated decently well enough. These are often repeated known knowns.
You might have people who can distinguish those, but I think it’s a mistake to speak of calibration in that sense as the word usually refers to people who actually trained to be calibrated via feedback.
I’m not sure I’d say that in the context of this post, but more generally, models are really useful. Predictions that come with useful models are a lot more useful than raw predictions. I wrote this other post about a similar topic.
For this specific post, I think what we’re trying to get is the best prediction we could have had using data pre-event.
Perhaps resolving forecasts with expert probabilities can be better than resolving them with the actual events.
The default in literature on prediction markets and decision markets is to expect that resolutions should be real world events instead of probabilistic estimates by experts. For instance, people would predict “What will the GDP of the US in 2025 be?”, and that would be scored using the future “GDP of the US.” Let’s call these empirical resolutions.
These resolutions have a few nice properties:
We can expect expect them to be roughly calibrated. (Somewhat obvious)
They have relatively high precision/sharpness.
While these may be great in a perfectly efficient forecaster market, I think they may be suboptimal for incentivizing forecasters to best estimate important questions given real constraints. A more cost-effective solution could look like having a team of calibrated experts[1] inspect the situation post-event, make their best estimate of the probability pre-event, and then use that for scoring predictions.
A thought Experiment
The intuition here could be demonstrated by a thought experiment. Say you can estimate a probability distribution X. Your prior, and the prior you expect that others has, indicates that 99.99% of X is definitely a uniform distribution, but the last 0.001% tail on the right is something much more complicated.
You could spend a lot of effort better estimating this 0.001% tail, but there is a very small chance this would be valuable to do. In 99.99% of cases, any work you do here will not effect your winnings. Worse, you may need to wager a large amount of money for a long period of time for this possibility of effectively using your tiny but better-estimated tail in a bet.
Users of that forecasting system may care about this tail. They may be willing to pay for improvements in the aggregate distributional forecast such that it better models an enlightened ideal. If it were quickly realized that 99.99% of the distribution was uniform, then any subsidies for information should go to those that did a good job improving the 0.001% tail. It’s possible that some pretty big changes to this tail could be figured out.
Say instead that you are estimating the 0.001% tail, but you know you will be scored against a probability distribution selected by experts post-result, instead of the actual result. Say, these experts get to see all previous forecasts and discussion, so in expectation only respond with a forecast that is more sharp than the aggregate. In this case all of their work will be focused on this tail, so all of the differences in forecasters may come from this sliver.
This setup would require the experts[1] to be calibrated.
Further Work
I’m sure there’s a mathematical representation to better showcase this distinction, and to specify the loss of motivation that traders would have on probabilities that they know will be resolved empirically rather than judgmentally (using the empirical data in these judgements.) There must be something in statistical learning theory or similar that deals with similar problems; for instance, I imagine a classifier may be able to perform better when learning against “enlightened probabilities” instead of “binary outcomes”, as there is more clear signal there.
[1] I use “Experts” here to refer to a group estimated to provide the highly accurate estimates, instead of domain-respected experts.
Here is another point by @jacobjacob, which I’m copying here in order for it not to be lost in the mists of time:
I’m really interested in this type of scheme because it would also solve a big problem in futarchy and futarchy-like setups that use prediction polling, namely, the inability to score conditional counterfactuals (which is most of the forecasting you’ll be doing in Futarchy-like setup).
One thing you could do instead of scoring people against expert assesments is also potentially score people against the final aggregate and extremized distribution.
One issue with any framework like this is that general calibration may be very different than calibration at the tails. Whatever scoring rule you’re using to determine calibration of experts or aggregate scoring has the same issue that long tail events rarely happen.
Another solution to this problem (although it doesn’t solve the counterfactual conditional problem) is to create tailored scoring rules that provide extra rewards for events at the tails. If an event at the tails is a million times less likely to happen, but you care about it equally to events at the center, then provide a million times reward for accuracy near the tail in the event it happens. Prior work on tailored scoring rules for different utility functions here: https://www.evernote.com/l/AAhVczys0ddF3qbfGk_s4KLweJm0kUloG7k/
Good points!
Also, thanks for the link, that’s pretty neat.
I think that an efficient use of expert assessments would be for them to see the aggregate, and then basically adjust that as is necessary, but to try to not do much original research. I just wrote a more recent shortform post about this.
I think that we can get calibration to be as good as experts can figure out, and that could be enough to be really useful.
Another point in favor of such a set-up would be that aspiring superforecasters get much, much more information when they see ~[the prediction of a superforecaster would have made having their information]; a point vs a distribution. I’d expect that this means that market participants would get better, faster.
Yep, this way would basically be much more information-dense, with all the benefits that comes from that.
You can train experts to be calibrated in different ways. If you train experts to be calibrated to pick the right probability on GPOpen where probability is done in steps on 1, I don’t think those experts will be automatically calibrated to distinguish a p=0.00004 event from a p=0.00008.
Experts would actually need to be calibrated on getting probabilities inside the tail right. I don’t think we know how to do calibration training for that tail.
I think this could be a good example for what I’m getting at. I think there are definitely some people in some situations who can distinguish a p=0.00004 event from a p=0.00008 event. How? By making a Fermi model or similar.
A trivial example would be a lottery with calculable odds of success. Just because the odds are low doesn’t mean they can’t be precisely estimated.
I expect that the kinds of problems that GPOpen would consider asking AND are incredibly unlikely, would be difficult to estimate within 1 order of magnitude. But may still be able to do a decent job, especially in cases where you can make neat Fermi models.
However, of course, it seems very silly to use the incentive mechanism “you’ll get paid once we know for sure if the event happened” on such an event. Instead, if resolutions are done with evaluators, then there is much more of a signal.
I’m fairly skeptical of this. From a conceptual perspective, we expect the tails to be dominated by unknown unknowns and black swans. Fermi estimates and other modelling tools are much better at estimating scenarios that we expect. Whereas, if we find ourselves in the extreme tails, its often because of events or factors that we failed to model.
I’m not sure. The reasons things happen at the tails typically fall into categories that could be organized to be a small set.
For instance:
The question wasn’t understood correctly.
A significant exogenous event happened.
But, as we do a bunch of estimates, we could get empirical data about these possibilities, and estimate the potentials for future tails.
This is a bit different to what I was mentioning, which was more about known but small risks. For instance, the “amount of time I spend on my report next week” may be an outlier if I die. But the chance of serious accident or death can be estimated decently well enough. These are often repeated known knowns.
You might have people who can distinguish those, but I think it’s a mistake to speak of calibration in that sense as the word usually refers to people who actually trained to be calibrated via feedback.
So you don’t want predictions*, you want models**.
Robust/fully fleshed out models.
*predictions of events
**predictions of which model is correct
I’m not sure I’d say that in the context of this post, but more generally, models are really useful. Predictions that come with useful models are a lot more useful than raw predictions. I wrote this other post about a similar topic.
For this specific post, I think what we’re trying to get is the best prediction we could have had using data pre-event.