Some Thoughts on Conditional Forecasts – Lessons from the 2020 Election
Disclaimer: This post was written as part of my job at Open Phil, but it hasn’t been reviewed closely by anyone else at Open Phil, and the opinions and recommendations I put forward are mine only and don’t reflect Open Phil’s views.
In early 2020, Open Philanthropy commissioned forecasts from Hypermind on five socioeconomic indicators. These forecasts were conditional on the result of the 2020 election, i.e. questions were of the form “What will be the value $X on <date> if Trump <wins / doesn’t win>?”. The five indicators (all about the US) were: total COVID deaths, GDP, S&P 500, prime-age employment-population ratio, and rank in the World Happiness Report. The resolution dates were EOY 2022 and 2024. Forecasts were submitted between August and November 2020.
Now that the 2022 data is in, I summarize what I see as the key results in the tables and figures below.[1] I will only evaluate the forecasts using the mainline aggregation algorithm evaluated on the last day the forecasting window was open. This means I won’t talk about (i) individual differences in forecasting skill, (ii) how forecasts changed over time, or (iii) the merits of different aggregation methods.
Results
Accuracy
2⁄5 forecasts were better than chance, as measured by the log score (zero being chance level).
4⁄5 missed the mark by quite a lot, as measured by the CDF evaluated at the true value (which should be 0.5 on average).
Question | Log score[2] | CDF[3] | Interpretation |
COVID Deaths | -inf[4] | 1.000 | Much worse than chance, wildly optimistic |
World Happiness Report Rank | -1.054 | 0.052 | Worse than chance, somewhat pessimistic |
Employment to Population Ratio | -2.380 | 0.995 | Worse than chance, pessimistic |
S&P 500 | 2.881 | 0.418 | Beats chance, no evidence of bias |
GDP | 1.440 | 0.933 | Beats chance, somewhat pessimistic |
Differences between conditions
The differences between the two conditions were small in general
Forecasters expected worse and more uncertain outcomes under Trump, with a few small exceptions like employment and GDP being marginally better in 2022.
Question | Year | Entropy ratio[7] | Mean difference / range[8] |
COVID Deaths | 2022 | 1.015 | 0.088 |
2024 | 1.027 | 0.144 | |
World Happiness Report Rank | 2022 | 1.047 | 0.016 |
2024 | 1.145 | 0.001 | |
Employment to Population Ratio | 2022 | 1.084 | -0.008 |
2024 | 1.009 | 0.004 | |
S&P 500 | 2022 | 1.130 | 0.075 |
2024 | 1.001 | 0.054 | |
GDP | 2022 | 0.978 | -0.003 |
2024 | 1.110 | 0.002 |
Narrative takeaways
Forecast accuracy varied across question categories. From best to worst:
The economic forecasts ranged from good (GDP, S&P 500) to poor (employment-to-population ratio was underestimated by 4 percentage points). My guess is that the first two may have been easier because forecasters could fall back on simple extrapolation or defer to the market by e.g. taking into account the price of certain financial derivatives.[9] Labor force participation forecasts can’t benefit from stubborn secular trends or deep, liquid prediction markets. This hypothesis is just post hoc speculation, so take it with a grain of salt.
The forecast regarding the US rank in the World Happiness Report seems unimpressive to me (17th-20th expected vs. 15th actual). This is roughly in line with Metaculus’s but it gets a negative log score whereas Metaculus got a positive one. I wonder if asking directly about average happiness would have resulted in a more accurate forecast. Ranks can be brittle and messy because they depend on the ranks of every other country in the report (although most are so far apart that they have limited practical impact).
The Covid death forecast was a huge underestimate (100k-300k vs. 750k). Still, it was generally consistent with other forecasts at the time and in early 2021 (e.g., this one and this one from Metaculus) that largely missed the fall and winter waves in 2020 and 2021 driven by new variants.
How should we interpret the fact that three key economic indicators were expected to be roughly the same under two (potentially very different) administrations?
On the one hand, the result raises questions about the value of conditional forecasting: if such a substantial difference in the world as Trump being in the White House vs. not doesn’t lead to divergent forecaster expectations very often, what might?
On the other hand, maybe we shouldn’t have expected the choice of president to affect most of these specific outcomes, e.g. my understanding is that economic growth doesn’t care too much which party is in power. Also, there’s a track record of conditional questions resulting in large, meaningful differences, e.g. on Metaculus.[10] This makes me believe that the questions in this tournament weren’t very well suited to revealing the usefulness of conditional forecasting.
A possible counterargument: if one saw a high chance of autocratic backslide or violent insurrection in one condition but not the other, these questions (as well as many others) would’ve been likely to show a significant difference. One could argue that it wasn’t obvious at the time that those catastrophic outcomes wouldn’t happen, but I think there’s weak evidence against that: all throughout 2020, Metaculus estimated a ~0.5% probability of a second civil war in the US before July 2021, peaking at ~3% around J6. Note however that this is a relatively high bar compared to “autocracy” or “insurrection”.
“Expected Covid deaths” and “Rank in the World Happiness Report” did result in a meaningful difference between the two conditions, but this is undermined by the fact that these were also the least accurate predictions in 2022.[11] My rough intuition is that, if a forecasting method produces a difference of X units between conditionals, but we know based on that method’s track record that the expected absolute error is >X, we should be less inclined to think the difference in forecasts is coming from signal rather than noise. Put another way, the lack of precision explains away the difference in forecasts.
- ^
See accompanying colab notebook.
- ^
This is calculated as log2(pdf(true value) * number of bins). The normalization is such that a uniform distribution over all bins would get a score of zero.
- ^
This is the cumulative distribution function at the true value. Lower values indicate forecasters overshot. Higher values indicate forecasters undershot.
- ^
The pdf was zero at the true value.
- ^
The relevant question range is the difference between the maximum and minimum values allowed in forecasts.
- ^
A similar picture emerges if we normalize by the (pooled) standard deviation instead of the range.
- ^
This is `trump_entropy / other_entropy`.
- ^
This is `(trump_average—other_average) / question_range`.
- ^
I tried to find the mid-2020 price of S&P 500 futures expiring in December 2022, but couldn’t. This chart displays the relevant historical prices, but the price for that particular contract (ESZ22) is only available from mid-2021 onwards. As of May 2023, the forward-looking futures chain on Google Finance has contracts up until December 2027, but those expiring after June 2024 have no liquidity. My hypothesis is that, even though the contracts are technically available over longer time horizons, there’s no real trading up until ~18 months before expiration. I find this odd, so there’s a good chance I’m missing something.
- ^
E.g. at the time of writing, P(US restricts compute capacity before 2050 | HLMI by 2040) = 37%, but this goes down to 22% if there’s no HLMI by 2040.
- ^
In fact, the absolute error in both questions was larger than the difference between the means of the two conditional distributions. E.g. in the question about the World Happiness Report, the mean was 19.9 under Trump 18.4 under not-Trump. The actual number was 15, so the forecast missed the mark by ~2x the difference between conditions. The same calculation for the Covid question yields a factor of ~13x.
Interesting read, thanks for writing it up. FYI the link “The report on the 2022 results is now available” leads to a private Google Drive file.
Thanks for flagging. Fixed now.