I think one idea I’m excited about is the idea that predictions can be made of prediction accuracy. This seems pretty useful to me.
Example
Say there’s a forecaster Sophia who’s making a bunch of predictions for pay. She uses her predictions to make a meta-prediction of her total prediction-score on a log-loss scoring function (on all predictions except her meta-predictions). She says that she’s 90% sure that her total loss score will be between −5 and −12.
The problem is that you probably don’t think you can trust Sophia unless she has a lot of experience making similar forecasts.
This is somewhat solved if you have a forecaster that you trust that can make a prediction based on Sophia’s seeming ability and honesty. The naive thing would be for that forecaster to predict their own distribution of the log-loss of Sophia, but there’s perhaps a simpler solution. If Sophia’s provided loss distribution is correct, that would mean that she’s calibrated in this dimension (basically, this is very similar to general forecast calibration). The trusted forecaster could forecast the adjustment made to her term, instead of forecasting the same distribution. Generally this would be in the direction of adding expected loss, as Sophia probably had more of an incentive to be overconfident (which would result in a low expected score from her) than underconfident. This could perhaps make sense as a percentage modifier (-30% points), a mean modifier (-3 to −8 points), or something else.
External clients would probably learn not to trust Sophia’s provided expected error directly, but instead the “adjusted” forecast.
This can be quite useful. Now, if Sophia wants to try to “cheat the system” and claim that she’s found new data that decreases her estimated error, the trusted forecaster will pay attention and modify their adjustment accordingly. Sophia will then need to provide solid evidence that she really believes her work and is really calibrated for the trusted forecaster to budge.
I want to call this something like forecast appraisal, attestation, or pinning. Please leave comments if you have ideas.
“Trusted Forecaster” Error
You may be wondering how we ensure that the “trusted” forecaster is actually good. For one thing, they would hopefully go through the same procedure. I would imagine there could be a network of “trusted” forecasters that are all estimating each other’s predicted “calibration adjustment factors”. This wouldn’t work if observers didn’t trust any of these or thought they were colluding, but could if they had one single predictor they trusted. Also, note that over time data would come in and some of this would be verified.
The idea of focusing a lot on “expected loss” seems quite interesting to me. One thing it could encourage is contracts or Service Level Agreements. For instance, I could propose a 50⁄50 bet for anyone, for a percentile of my expected loss distribution. Like, “I’d be willing to bet $1,000 with anyone that the eventual total error of my forecasts will be less than the 65th percentile of my specified predicted error.” Or, perhaps a “prediction provider” would have to pay back an amount of their fee, or even more, if the results are a high percentile of their given predicted errors. This could generally be a good way to verify a set of forecasts. Another example would be to have a prediction group make 1000 forecasts, then heavily subsidize one question on a popular prediction market that’s predicting their total error.
Markets For Purchasing Prediction Bundles
Of course, the trusted forecasters can not only forecast the “calibration adjustment factors” for ongoing forecasts, but they can also forecast these factors for hypothetical forecasts as well.
Say you have 500 questions that need to be predicted, and there are multiple agencies that all say they could do a great job predicting these questions. They all give estimates of their mean predicted error, conditional on them doing the prediction work. Then you have a trusted forecaster give a calibration adjustment.
Firm’s Predicted Error
Calibration Adjustment
Adjusted Predicted Error
Firm 1
−20
−2
−22
Firm 2
−12
−9
−21
Firm 3
−15
−3
−18
(Note: the lower the expected error, the worse)
In this case, Firm 2 makes the best claim, but is revealed to be significantly overconfident. Firm 3 has the best adjusted predicted error, so they’re the ones to go with. In fact, you may want to penalize Firm 2 further for being a so-called prediction service with apparent poor calibration skills.
Correlations
One quick gotcha; one can’t simply sum the expected errors of all of one’s predictions to get the total predicted error. This would treat them as independent, and there are likely to be many correlations between them. For example, if things go “seriously wrong”; it’s likely many different predictions will have high losses. To handle this perfectly would really require one model to have produced all forecasts, but if that’s not the case there could likely be simple ways to approximate this.
Bundles vs. Prediction Markets
I’d expect that in many cases, private services will be more cost-effective than posting predictions on full prediction markets. Plus, private services could be more private and custom. The general selection strategy in the table above could of course include some options that involve hosting questions on prediction markets, and the victor would be chosen based on reasonable estimates.
“I’d be willing to bet $1,000 with anyone that the eventual total error of my forecasts will be less than the 65th percentile of my specified predicted error.”
I think this is equivalent to applying a non-linear transformation to your proper scoring rule. When things settle, you get paid S(p) both based on the outcome of your object-level prediction p, and your meta prediction q(S(p)).
Hence:
S(p)+B(q(S(p)))
where B is the “betting scoring function”.
This means getting the scoring rules to work while preserving properness will be tricky (though not necessarily impossible).
One mechanism that might help is that if each player makes one object prediction p and one meta prediction q, but for resolution you randomly sample one and only one of the two to actually pay out.
Interesting, thanks! Yea, agreed it’s not proper. Coming up with interesting payment / betting structures for “package-of-forecast” combinations seems pretty great to me.
Abstract. A potential downside of prediction markets is that they may incentivize agents to take undesirable actions in the real world. For example, a prediction market for whether a terrorist attack will happen may incentivize terrorism, and an in-house prediction market for whether a product will be successfully released may incentivize sabotage. In this paper, we study principal-aligned prediction mechanisms– mechanisms that do not incentivize undesirable actions. We characterize all principal-aligned proper scoring rules, and we show an “overpayment” result, which roughly states that with n agents, any prediction mechanism that is principal-aligned will, in the worst case, require the principal to pay Θ(n) times as much as a mechanism that is not. We extend our model to allow uncertainties about the principal’s utility and restrictions on agents’ actions, showing a richer characterization and a similar “overpayment” result.
This is somewhat solved if you have a forecaster that you trust that can make a prediction based on Sophia’s seeming ability and honesty. The naive thing would be for that forecaster to predict their own distribution of the log-loss of Sophia, but there’s perhaps a simpler solution. If Sophia’s provided loss distribution is correct, that would mean that she’s calibrated in this dimension (basically, this is very similar to general forecast calibration). The trusted forecaster could forecast the adjustment made to her term, instead of forecasting the same distribution. Generally this would be in the direction of adding expected loss, as Sophia probably had more of an incentive to be overconfident ( which would result in a low expected score from her) than underconfident. This could perhaps make sense as a percentage modifier (-30% points), a mean modifier (-3 to −8 points), or something else. Is it actually true that forecasters would find it easier to forecast the adjustment?> This is somewhat solved if you have a forecaster that you trust that can make a prediction based on Sophia’s seeming ability and honesty. The naive thing would be for that forecaster to predict their own distribution of the log-loss of Sophia, but there’s perhaps a simpler solution. If Sophia’s provided loss distribution is correct, that would mean that she’s calibrated in this dimension (basically, this is very similar to general forecast calibration). The trusted forecaster could forecast the adjustment made to her term, instead of forecasting the same distribution. Generally this would be in the direction of adding expected loss, as Sophia probably had more of an incentive to be overconfident ( which would result in a low expected score from her) than underconfident. This could perhaps make sense as a percentage modifier (-30% points), a mean modifier (-3 to −8 points), or something else.
Is it actually true that forecasters would find it easier to forecast the adjustment?
One nice thing about adjustments is that they can be applied to many forecasts. Like, I can estimate the adjustment for someone’s [list of 500 forecasts] without having to look at each one.
Over time, I assume that there would be heuristics for adjustments, like, “Oh, people of this reference class typically get a +20% adjustment”, similar to margins of error in engineering.
That said, these are my assumptions, I’m not sure what forecasters will find to be the best in practice.
I think one idea I’m excited about is the idea that predictions can be made of prediction accuracy. This seems pretty useful to me.
Example
Say there’s a forecaster Sophia who’s making a bunch of predictions for pay. She uses her predictions to make a meta-prediction of her total prediction-score on a log-loss scoring function (on all predictions except her meta-predictions). She says that she’s 90% sure that her total loss score will be between −5 and −12.
The problem is that you probably don’t think you can trust Sophia unless she has a lot of experience making similar forecasts.
This is somewhat solved if you have a forecaster that you trust that can make a prediction based on Sophia’s seeming ability and honesty. The naive thing would be for that forecaster to predict their own distribution of the log-loss of Sophia, but there’s perhaps a simpler solution. If Sophia’s provided loss distribution is correct, that would mean that she’s calibrated in this dimension (basically, this is very similar to general forecast calibration). The trusted forecaster could forecast the adjustment made to her term, instead of forecasting the same distribution. Generally this would be in the direction of adding expected loss, as Sophia probably had more of an incentive to be overconfident (which would result in a low expected score from her) than underconfident. This could perhaps make sense as a percentage modifier (-30% points), a mean modifier (-3 to −8 points), or something else.
External clients would probably learn not to trust Sophia’s provided expected error directly, but instead the “adjusted” forecast.
This can be quite useful. Now, if Sophia wants to try to “cheat the system” and claim that she’s found new data that decreases her estimated error, the trusted forecaster will pay attention and modify their adjustment accordingly. Sophia will then need to provide solid evidence that she really believes her work and is really calibrated for the trusted forecaster to budge.
I want to call this something like forecast appraisal, attestation, or pinning. Please leave comments if you have ideas.
“Trusted Forecaster” Error
You may be wondering how we ensure that the “trusted” forecaster is actually good. For one thing, they would hopefully go through the same procedure. I would imagine there could be a network of “trusted” forecasters that are all estimating each other’s predicted “calibration adjustment factors”. This wouldn’t work if observers didn’t trust any of these or thought they were colluding, but could if they had one single predictor they trusted. Also, note that over time data would come in and some of this would be verified.
The idea of focusing a lot on “expected loss” seems quite interesting to me. One thing it could encourage is contracts or Service Level Agreements. For instance, I could propose a 50⁄50 bet for anyone, for a percentile of my expected loss distribution. Like, “I’d be willing to bet $1,000 with anyone that the eventual total error of my forecasts will be less than the 65th percentile of my specified predicted error.” Or, perhaps a “prediction provider” would have to pay back an amount of their fee, or even more, if the results are a high percentile of their given predicted errors. This could generally be a good way to verify a set of forecasts. Another example would be to have a prediction group make 1000 forecasts, then heavily subsidize one question on a popular prediction market that’s predicting their total error.
Markets For Purchasing Prediction Bundles
Of course, the trusted forecasters can not only forecast the “calibration adjustment factors” for ongoing forecasts, but they can also forecast these factors for hypothetical forecasts as well.
Say you have 500 questions that need to be predicted, and there are multiple agencies that all say they could do a great job predicting these questions. They all give estimates of their mean predicted error, conditional on them doing the prediction work. Then you have a trusted forecaster give a calibration adjustment.
(Note: the lower the expected error, the worse)
In this case, Firm 2 makes the best claim, but is revealed to be significantly overconfident. Firm 3 has the best adjusted predicted error, so they’re the ones to go with. In fact, you may want to penalize Firm 2 further for being a so-called prediction service with apparent poor calibration skills.
Correlations
One quick gotcha; one can’t simply sum the expected errors of all of one’s predictions to get the total predicted error. This would treat them as independent, and there are likely to be many correlations between them. For example, if things go “seriously wrong”; it’s likely many different predictions will have high losses. To handle this perfectly would really require one model to have produced all forecasts, but if that’s not the case there could likely be simple ways to approximate this.
Bundles vs. Prediction Markets
I’d expect that in many cases, private services will be more cost-effective than posting predictions on full prediction markets. Plus, private services could be more private and custom. The general selection strategy in the table above could of course include some options that involve hosting questions on prediction markets, and the victor would be chosen based on reasonable estimates.
I think this is equivalent to applying a non-linear transformation to your proper scoring rule. When things settle, you get paid S(p) both based on the outcome of your object-level prediction p, and your meta prediction q(S(p)).
Hence:
S(p)+B(q(S(p)))
where B is the “betting scoring function”.
This means getting the scoring rules to work while preserving properness will be tricky (though not necessarily impossible).
One mechanism that might help is that if each player makes one object prediction p and one meta prediction q, but for resolution you randomly sample one and only one of the two to actually pay out.
Interesting, thanks! Yea, agreed it’s not proper. Coming up with interesting payment / betting structures for “package-of-forecast” combinations seems pretty great to me.
I think this paper might be relevant: https://users.cs.duke.edu/~conitzer/predictionWINE09.pdf
Is it actually true that forecasters would find it easier to forecast the adjustment?
One nice thing about adjustments is that they can be applied to many forecasts. Like, I can estimate the adjustment for someone’s [list of 500 forecasts] without having to look at each one.
Over time, I assume that there would be heuristics for adjustments, like, “Oh, people of this reference class typically get a +20% adjustment”, similar to margins of error in engineering.
That said, these are my assumptions, I’m not sure what forecasters will find to be the best in practice.