Resolutions to the Challenge of Resolving Forecasts
One of the biggest challenges in forecasting is formulating clear and resolvable questions, where the resolution reflects the intent of the question. When this doesn’t happen, there is often uncertainty about the way the question will be resolved, leading to uncertainty about what to predict. I want to discuss this problem, and in this post, point out that there are a variety of methods which are useful for resolving predictions.
But first, the problem.
What is the Problem?
The OpenPhil / Good Judgement COVID-19 dashboard provides an example. The goal was to predict the number of cases and deaths due to COVID. The text of the questions was “How many X will be reported/estimated as of 31 March 2021?” and the explanation clarified that “The outcome will be determined based on reporting and estimates provided by Johns Hopkins of X...”
Early on, the question was fairly clear—it was about what would happen. As time went on, however, it became clearer that because reporting was based on very limited testing, there would be a significant gap between the reported totals, and the estimated totals. Discussions of what total to predict were then partly side-tracked by predictions of whether the reports or the predictions would be used—with a very large gap between the two.
Solving the Problem?
This is a very general problem for forecasting, and various paths towards solutions have been proposed. One key desiderata, however, is clear; whatever resolution criteria are used, they should be explicitly understood beforehand. The choice of which approach to be clear you are using, however, is still up for debate—and I’ll present various approaches in this post.
Be Inflexible
We can be inflexible with resolution criteria, and always specify exactly what number or fact will be used for the resolution, and never change that. To return to the example above, the COVID prediction could have been limited to, say, the final number displayed on the Johns Hopkins dashboard by the resolution date. If it is discontinued, it would use the final number displayed before then, and if it is modified to no longer display a single number, say, providing a range, it would use the final number before the change is introduced .
Of course, this means that even the smallest deviation from what you expected or planned for will lead to the question resolving in a way other than representing the outcome in question. Worse, the prediction is now in large part about whether something will trigger the criteria for effectively ending the question early. That means that the prediction is less ambiguous, but also less useful.
Eliminate Ambiguity
An alternative is to try to specify what happens in every case.If a range is presented, or alternative figures are available on an updated dashboard, the highest estimate or figure will be used. If the dashboard is discontinued, the people running it will be asked to provide a final number to resolve the question. If they do not reply, or do not agree on a specific value, a projection of the totals based on a linear regression using the final month of data will be used. This type of resolution requires specifying every possible eventuality, which is sometimes infeasible. It also needs to fall back on some final simple criteria to cover edge cases—and it needs to do so unambiguously.
Walk away
As another alternative, Metaculus sometimes chooses to leave a question as “ambiguous” if the data source is discontinued, or it is later discovered that for other reasons the resolution as stated doesn’t work—for example, a possibility other than those listed occurs. That is undesirable because the forecasters cannot get feedback, any awards are not given, forecasters feel like they have wasted their time, and the question that the prediction was supposed to answer ends up giving no information.
Predict Ambiguity
Augur, and perhaps other prediction markets, also allow for one of the resolutions to be “ambiguous” (or, “Invalid Market”, source). For example, a question on who was the president of Venezuela in 2020 might have been resolved as “Invalid” given that both Juan Guaidó and Maduro had a claim to the position. Crucially, “ambiguous” resolutions can be traded on (and thus predicted) on Augur—this creates a better incentive than walking away, but in cases where there is a “morally correct” answer, it falls short of ideal.
Resolve with a Probability
One way to make the resolution less problematic when the outcome is ambiguous is to resolve probabilistically, or similar. In such a case, instead of a yes or no question resolving with a binary yes or no, a question can resolve with a probability, with a confidence interval or with a distribution. This is the approach taken by Polymarket (example, for binary questions) or foretold (for continuous questions). We can imagine this as a useful solution if a baseball game is rained out. In such cases, perhaps the rules would be to pick a probability based on the resolution of past games—with the teams tied, it resolves at 50%, and with one team up by 3 runs in the 7th inning, it resolves at whatever percentage of games where a team is up by 3 runs at that point in the game wins.
Aside: Ambiguity can be good!
As the last two resolution methods indicate, eliminating ambiguity also greatly reduces the usefulness of a question. An example of this was a question in the original Good Judgement competition, which asked “Will there be a violent confrontation between China and a neighbor in the South China Sea?” and the resolution criteria was whether there was a fatal interaction between different countries. The predictions were intended to be about whether military confrontations would occur, but the resolution ended up being about a Chinese fisherman stabbing a South Korean coast guard officer.
Barb Mellers said that the resolution “just reflected the fact that life is very difficult to predict.” I would disagree—I claim that the resolution reflected a failure to make a question well aligned with the intent of the question, which was predicting increased Chinese aggression. But this is inevitable when questions are concrete, and the metric used is an imperfect proxy. (Don’t worry—I’m not going to talk about Goodhart’s Law yet again. But it’s relevant.)
This is why we might prefer a solution that allows some ambiguity, or at least interpretation, without depending on ambiguous or overly literal resolutions. I know of two such approaches.
Offloading Resolution
One approach for dealing with ambiguous resolutions that still resolves predictions unambiguously is to appeal to an outside authority.
A recent Metaculus question for the 20⁄20 Insight Prediction Contest asked about a “Democratic majority in the US senate” and when the Senate was tied, with Democratic control due to winning the presidency, the text of the question was cited—it said “The question resolves positively if Democrats hold 51 seats or more in the Senate according to the official election results.” Since the vice president votes, but does not hold a seat, the technical criteria were not met, despite the result being understood informally as a “Democratic majority.” The Metaculus admins said that they agreed on the question resolution due to the “51 seats or more” language.
Instead of relying narrowly on the wording, however, the contest rules were that in case of any ambiguity, the contest administrators “will consult at least three independent individuals, blind to our hypotheses and to the identity of participants, to make a judgment call in these contested cases.” The question resolution didn’t change, but the process was based on outside advisors.
Meta-Resolution
Even more extreme, a second approach is that forecasters can be asked to predict what they think the experts will decide. This is instead of predicting a narrow and well specified outcome, and allows for predicting things that are hard to pin down at present.
This is the approach proposed by Jacob Lagerros and Ben Gold for an AI Forecasting Resolution Council, where they propose using a group of experts to resolve otherwise likely-to-become-ambiguous questions. Another example of this is Kleros, a decentralized dispute resolution service. To use it, forecasts could have the provision that they be submitted to Kleros if the resolution is unclear, or perhaps all cases would be resolved that way.
This potentially increases fidelity with the intent of a question—but has costs. First, there are serious disadvantages to the ambiguity, since forecasters are predicting a meta-level outcome. Second, there are both direct and management costs to having experts weigh-in on predictions. And lastly, this doesn’t actually avoid the problem with how to resolve the question—it offloads it, albeit in a way that can decrease the costs of figuring out how to decide.
As an interesting application of a similar approach, meta-forecasts have also been proposed as a way to resolve very long term questions. In this setup, we can ask forecasters to predict what a future forecast will be. Instead of predicting the price of gold in 2100, they can predict what another market will predict in 2030 - and perhaps that market can itself be similarly predicting a market in 2040, and so on. But this strays somewhat from this posts’ purpose, since the eventual resolution is still clear.
Conclusion
In this post, I’ve tried to outline the variety of methods that exist for resolving forecasts. I think this is useful as a reference and starting point for thinking about how to create and resolve forecasts. I also think it’s useful to frame a different problem that I want to discuss in the next post, about the difference between ambiguity and flexibility, and how to allow flexibility without making resolutions as ambiguous.
Thanks
Thanks to Ozzie Gooen for inspiring the post. Thanks also to Edo Arad, Nuño Sempere, and again, Ozzie, for helpful comments and suggestions.
- Systematizing Epistemics: Principles for Resolving Forecasts by 29 Mar 2021 20:46 UTC; 33 points) (
- Forecasting Newsletter: March 2021 by 1 Apr 2021 17:12 UTC; 23 points) (
- Forecasting Newsletter: March 2021 by 1 Apr 2021 17:01 UTC; 22 points) (EA Forum;
- Personal predictions for decisions: seeking insights by 15 Feb 2023 6:45 UTC; 2 points) (
Some ideas:
IIRC you wrote a previous post about Goodhart which included the idea that keeping evaluation methods secret can help avert Goodhart. The same idea seems relevant here: if the exact resolution method is not known to forecasters, it’s more worthwhile for them to put effort into better forecasts on the intended topic, rather than forecasting the unintended corner cases.
For cases like the COVID numbers, partial resolution seems like it would be useful: you can’t get the exact numbers, but you can get relatively firm lower and upper bounds. A prediction market could partially pay out bets; in forecasting more generally it should be possible to partially score. The numerical estimates might be improved over time, so EG you get an initial partial payout when the first statistics become available, and refined estimates gradually close the confidence intervals.
For offloading resolution / meta-resolution, one helpful mechanism might be to pay experts based on agreement with the majority of experts. This could be used in cases where there is low trust but low risk of collusion, so that the Schelling point for those involved is to give the common-sense judgement.
Partial resolution could also help with getting some partial signal on long term forecasts.
In particular, if we know that a forecasting target is growing monotonously over time (like “date at which X happens” or “cumulative number of X before a specified date”), we can split P(outcome=T) into P(outcome>lower bound)*P(outcome=T|outcome>lower bound). If we use log scoring, we then get log(P(outcome>lower bound)) as an upper bound on the score.
If forecasts came in the form of more detailed models, it should be possible to use a similar approach to calculate bounds based on conditioning on more complicated events as well.
“partial resolution seems like it would be useful”
I hadn’t thought of this originally, but Nuno added the category of “Resolve with a Probability,” which does this. The idea of iterated closing of a question as the bounds improve is neat, but probably technically challenging. (GJ Inc. kind-of does this when they close answer options that are already certain to be wrong, such as total ranges below the current number of CVOID cases.) I’d also worry it creates complexity that makes it much less clear to forecasters how things will work.
”one helpful mechanism might be to pay experts based on agreement with the majority of experts”
Yes—this has been proposed under the same set of ideas as “meta-forecasts have also been proposed as a way to resolve very long term questions,” though I guess it has clearer implications for otherwise ambiguous short term questions. I should probably include it. The key problem in my mind, which isn’t necessarily fatal, is that it makes incentive compatibility into a fairly complex game-theoretic issue, with collusion and similar issues being possible.
”keeping evaluation methods secret can help avert Goodhart”
Yes, I’ve definitely speculated along those lines. But for the post, I was worried that once I started talking about this as a Goodhart-issue, I would need to explain far more, and be very side-tracked, and it’s something I will address more in the next post in any case.
Here’s how I imagine it working.
Suppose a prediction market includes a numerically-valued proposition, like if we forecast COVID numbers not by putting probabilities on different ranges, but rather, by letting people buy and sell contracts which pay out proportional to COVID numbers. The market price of such a contract becomes our projection. (Or, you know, some equivalent mechanism for non-cash markets.)
Then, when we get partial information about COVID numbers, we create a partial payout: if we’re confident covid numbers for a given period were at least 1K, we can cause sellers of the contract to pay 1K’s worth to buyers. As the lower bound gets better, they pay more.
Of course, the mathematical work deciding when we can be “confident” of a given lower bound can be challenging, and the forecasters have to guess how this will be handled.
And a big problem with this method is that it will low-ball the number in question, since the confidence interval will never close up to a single number, and forecasters only have to worry about the lower end of the confidence interval.
I think we agree on this—iterated closing is an interesting idea, but I’m not sure it solves a problem. It doesn’t help with ambiguity, since we can’t find bounds. And earlier payouts are nice, but by the time we can do partial payouts, they are either tiny, because of large ranges, or they are not much before closing. (They also create nasty problems with incentive compatibility, which I’m unsure can be worked out cleanly.)
Here’s an idea I’ve been ruminating on: create a bunch of nearly identical forecast questions, all worded slightly differently, and grade with maximum inflexibility. Sometimes a pair of nearly identical questions will come to opposite resolutions. In such cases, forecasters who pay close attention to the words may be able to get both questions right, whereas people who treated them the same will get one right and one wrong.
On average, wouldn’t this help things a bit?
It’s an interesting idea, but one that seems to have very high costs for forecasters in keeping the predictions updated and coherent.
If we imagine that we pay forecasters the market value of their time, an active forecasting question with a couple dozen people spending a half hour each updating their forecast “costs” thousands of dollars per week. Multiplying that, even when accounting for reduced costs for similar questions, seems not worth the cost.
Hm okay. And is this a problem for prediction markets too, even though participants expect to profit from their time spent?
The way I imagine it, sloppier traders will treat a batch of nearly identical questions as identical, arbitraging among them and causing the prices to converge. Meanwhile, the more literal-minded traders will think carefully about how the small changes in the wording might imply large changes in probability, and they will occasionally profit by pushing the batch of prices apart.
But maybe most traders won’t be that patient, and will prefer meta-resolution or offloading.
I still feel like I’m onto something here...
Generally agree that there’s something interesting here, but I’m still skeptical that in most prediction market cases there would be enough money across questions, and enough variance in probabilities, for this to work well.
Sounds like Pascal’s problem of the points, where the solution is to provide the expected value of winnings, and not merely allocate all winnings to which player has the highest probability of victory. Suppose 1 team has 51% probability of winning—should the traders who bought that always get a 100% payoff and the 49% shares be worthless? That sounds extremely distortionary if it happens at all frequently.
Plus quite hard to estimate: if you had a model more accurate than the prediction market, it’s not clear why you would be using the PM in the first place. On the other hand, there is a source of the expected value of each share which incorporates all available information and is indeed close at hand: the share prices themselves. Seems much fairer to simply liquidate the market and assign everyone the last traded value of their share.
Yes, that was exactly what I was thinking of, but 1) I didn’t remember the name, and 2) I wanted a concrete example relevant to prediction markets.
And I agree it’s hard to estimate in general, but the problem can still be relevant in many cases—which is why I used my example. In the baseball game, if the market closes before the game begins—we don’t have a model as good as the market, but once the game is 7/9th complete, we can do better than the pre-game market prediction.
Why close the markets, though?
For betting markets, the market maker may need to manage the odds differently, and for prediction markets, it’s because otherwise you’re paying people in lower brier scores for watching the games, rather than being good predictors beforehand. (The way that time-weighted brier scores work is tricky—you could get it right, but in practice it seems that last minute failures to update are fairly heavily penalized.)