I was curious to see how various prognosticators—specifically, FiveThirtyEight and The Economist’s models, and the PredictIt prediction markets—did on predicting the state-by-state (plus the District of Columbia) results of the recent U.S. presidential election.
Mathematical Sidebar
There are various ways to evaluate probabilistic predictions, but my favorite is to use the logarithmic score: the logarithm of the probability assigned to the right answer. (Logs of probabilities are negative numbers, but negating everything to get positive numbers doesn’t change anything interesting—you just minimize instead of maximizing—such that I tend to mentally conflate the log and the negative-log.)
The reason I think the logarithmic score is best is because it has one essential property, one cool property, and a meaningful interpretation.
The essential property is that it incentivizes you to report your actual probabilities: if something actually happens 80% of the time, you get the best expected score by giving it probability 0.8.
The cool property is that it doesn’t matter how you chop up your observations: because conjunction of probabilities is multiplication and the logarithm maps multiplication to addition, we get the same score whether we consider “A&B” as one event, or separately add up the scores of “A” and “B, given A”.
(Although as David Schneider-Joseph and Oscar Cunningham pointed out in response to the originally published version of this post, my calculations later in this post add up scores of state-by-state predictions as if the state results were independent events, but this independence assumption is not realistic. Sorry.)
The interpretation is that the logarithmic score represents the length of the message you would need to encode the actual outcome using a code optimized for your model. (Er, the negation of the score represents the length—there’s that conflation.)
Methodology
So, the election results aren’t really final until all the states certify their results—or perhaps, when the Electoral College meets. The Trump campaign has filed lawsuits disputing results in a number of states, and at time of writing, ABC News’s map has no call for Alaska, Arizona, North Carolina, and Georgia. But for the purposes of this blog post, I’m just going to go with the current colors on Politico’s map, and assume that Biden wins Arizona and Georgia and that Trump wins Alaska and North Carolina.
At around 20:10 Pacific time on 2 November, Election Day Eve, I downloaded the model-output ZIP files from FiveThirtyEight and The Economist. Then today, I looked up 2 November prices for the Democrat-wins contracts in the line-graphs on the pages for the PredictIt “Which party will win this-and-such-state in the 2020 presidential election?” markets. (Both FiveThirtyEight and The Economist would later issue final updates on Election Day, but I’m assuming it couldn’t have changed much, and it’s a fairer comparison to the PredictIt bucketed-by-day line graphs to use the model outputs from Election Day Eve.)
I got The Economist’s per-state Biden-victory probabilities from the projected_win_prob
column in /output/site_data//state_averages_and_predictions_topline.csv
in The Economist’s zipfile, and FiveThirtyEight’s per-state Biden-victory probabilities from the winstate_chal
column in /election-forecasts-2020/presidential_state_toplines_2020.csv
in FiveThirtyEight’s zipfile.
I’m only using the states’ at-large results and ignoring the thing where Nebraska and Maine do some of their electoral votes by Congressional district.
I’m using the convention of giving probabilities in terms of the event of interest being “Biden/Democrat wins” rather than “Trump/Republican wins”, because that’s what The Economist seems to be giving me, but I appreciate that the FiveThirtyEight spreadsheet has separate winstate_inc
(incumbent), winstate_chal
(challenger), and winstate_3rd
(thirty-party) columns. (Although the winstate_3rd
column is blank!!) Using “incumbent wins” as the event actually seems like a more natural Schelling-point convention to me?—but it doesn’t matter.
I wrote a Python script to calculate the scores. The main function looks like this:
def score():
scores = {"fivethirtyeight": 0, "economist": 0, "predictit": 0}
swinglike_only_scores = {
"fivethirtyeight": 0,
"economist": 0,
"predictit": 0,
}
losses = {}
swinglikes = [sr for sr in state_results if is_swinglike(sr)]
print(
"{} states ({}) are swinglike".format(
len(swinglikes),
','.join(sr.state for sr in swinglikes)
)
)
for state_result in state_results:
if state_result.actual is None:
continue
for predictor in scores.keys():
if state_result.actual: # Biden win
probability = getattr(state_result, predictor)
else:
probability = 1 - getattr(state_result, predictor)
subscore = log2(probability)
scores[predictor] += subscore
if is_swinglike(state_result):
swinglike_only_scores[predictor] += subscore
losses[(predictor, state_result.state)] = subscore
return scores, swinglike_only_scores, losses
(Full source code, including inline data.)
Getting the numbers from both spreadsheets and the browser line graphs into my script involved a lot of copy-pasting/Emacs-keyboard-macros/typing, and I didn’t exhaustively double-check everything, so it’s possible I made some mistakes, but I hope that I didn’t, because that would be embarrassing.
Results
Scoring over all the states, The Economist did the best, with a score of about −7.51 bits, followed by FiveThirtyEight at −9.24 bits, followed by PredictIt at −12.14.
But I was worried that this methodology might privilege the statistical models over the markets, because the models (particularly The Economist, which was the most confident across the board) might be eking out more points by assigning very low/high probabilities to “safe” states in ways that the PredictIt markets won’t: intuitively, I suspect that the fact that someone is willing to scoop up Biden-takes-Alabama contracts at 2¢ each may not be capturing the full power of what the market can do on harder questions. So I also calculated the scores on just the 12 “swing-like” states (Alaska, Arizona, Florida, Georgia, Iowa, Montana, North Carolina, New Hampshire, Nevada, Ohio, Pennsylvania, and Texas) where at least two of our three predictors gave a probability between 0.1 and 0.9.
On just the “swing-like” states, the results are a lot more even, with The Economist at −7.36 bits, PredictIt at −7.40, and FiveThirtyEight at −7.89.
The five largest predictive “misses” were The Economist and FiveThirtyEight on Florida (the models leaned Biden but the voters said Trump, giving the models log scores of −2.22 and −1.66, respectively), The Economist and FiveThirtyEight on North Carolina (same story for scores of −1.60 and −1.46), and PredictIt on Georgia (the market leaned Trump, but we think the voters are saying Biden for a score of −1.32).
Does it make sense to calculate the score like this for events that aren’t independent? You no longer have the cool property that it doesn’t matter how you chop up your observations.
I think the correct thing to do would be to score the single probability that each model gave to this exact outcome. Equivalently you could add the scores for each state, but for each use the probability conditional on the states you’ve already scored. For 538 these probabilities are available via their interactive forecast.
Otherwise you’re counting the correlated part of the outcomes multiple times. So it’s not surprising that The Economist does best overall, because they had the highest probability for a Biden win and that did in fact occur.
My suggested method has the nice property that if you score two perfectly correlated events then the second one always gives exactly 0 points.
I just did the calculations. Using the interactive forecast from 538 gives them a score of −9.027; using the electoral_college_simulations.csv data from The Economist gives them a score of −7.841. So The Economist still wins!
Thanks; you are right and the thing I originally wrote was wrong. For posterity, I added a disclaimer (sixth paragraph) that credits you and the other person who pointed this out.
While I realize that the PredictIt predictions don’t actually have this data available, fivethirtyeight projects vote share by state. How much work do you think it would be to do an analysis that uses vote share instead of just wins by state?
Most closely contested states went to Biden, so vote share is more in Trump’s favor than you’d expect based on knowing only who won each state, and PredictIt generally predicted more votes for Trump, so I think PredictIt comes out a lot better than 538 and the Economist.
PredictIt does have this available for some states, e.g. Pennsylvania.
Here is a similar post.