Edit 2020/12/13: I think a lot of people have misunderstood my intent with this post. I talk about the 2020 election because it seemed like the perfect example to illustrate the point, but it was still only an example. The goal of this post is to establish a philosophically sound ground truth for probabilities.

1. Motivation

It seems to me that:

Predictions can be better or worse.
This is true even with a sample size of 1.
The quality of a prediction is not purely a function of its probability. (A 90% prediction for a false outcome may still be good, and two 70% predictions for the same outcome need not be equally good.)
A lot of people have a hard time wrapping their heads around how 1-3 can simultaneously be true

In this post, I work out a framework in which there is a ground truth for probabilities, which justifies 1-3.

2. The Framework

Consider the following game:


Throw a fair coin 100 times, counting the number of heads. You win the game if this number is at most 52 and lose if it is 53 or higher.

Using the binomial formula, we can compute that the probability of winning this game is around 0.69135. (In mathy notation, that’s $P (X \leq 52)$ with $X \sim B (100, \frac{1}{2}) .)$

Suppose we play this game, writing down each coin flip as it occurs, and put the progression into a chart:

Here, the red line denotes the number of heads we have after $n$ flips, and the blue line denotes the ‘baseline’, which is the number of heads such that continuing at this pace would end up at precisely 52 heads after 100 flips. We end the game above the baseline, with 54 heads, which means that we lose.

Since we know exactly how this game works, it’s possible to compute the current probability of winning at every point during the game.^[1] Here is a chart of these probabilities, given the flips from the chart above:

Note that the $y$ -axis shows the probability of winning the game after observing the $n$ -th flip. Thus, there are precisely 101 values here, going from the probability after observing the $0$ -th flip (i.e., without having seen any flips), which is $\approx 0.69135$ , to the probability after observing the $100$ -th flip, which is 0. By comparing both graphs visually, you can verify that they fit together.

Each of these 101 $y$ -values is a prediction for the outcome of the game. However, they’re evidently not equally well-informed predictions: the ones further to the right are based on more information (more flips) than those further to the left, and this is true even for predictions that output the same probability. For example, we predict almost exactly 80% after both 13 and 51 flips, but the latter prediction has a lot more information to go on.

I call a graph like this an information chart. It tells us how the probability of a prediction changes as a function of {amount of input information}.

A separate aspect that also influences the quality of a prediction is calibration. In this case, all of the 101 predictions made by the blue curve are perfectly calibrated: if we ran the game a million times, took all 101 million predictions made by the one million blue curves, and put them all into bins, each bin would (very likely) have a proportion of true predictions that closely resembles its probability.^[2]

However, while the blue curve has perfect calibration, we can also model imperfect calibration in this framework. To do this, I’ve computed the probability of winning, provided that we underestimate (red) or overestimate (green) the number of flips that are still missing, relative to those we know.^[3] The results look like this:

Looking at these three information charts together, we can see the following:

Whenever things look good, the red curve underestimates the remaining uncertainty and thus becomes too optimistic: it overshoots the blue curve.
Whenever things look bad, the red curve still underestimates the remaining uncertainty, but now that means it becomes too pessimistic: it undershoots the blue curve.
The green curve does the precise opposite: it always overestimates the remaining uncertainty and is thus less optimistic/less pessimistic than it should be.

Notably, ‘good’ and ‘bad’ have nothing to do with 50%. Instead, they are determined by the prior, the probability you would assign without any information. In this case, the prior is 0.69135, so the red curve overshoots the blue curve whenever the blue curve is above 0.69135 and undershoots the blue curve whenever the blue curve is below 0.69135. (And the green curve does the opposite in both cases.)

Unlike the blue curve, the red curve’s predictions are not perfectly calibrated. Their calibration is pretty good around the 70% bin (because the prior 0.69135 happens to be close to 70%), but poor everywhere else. Predictions in the 90% bin would come true less than 9 out of 10 times, and predictions in the 50% bin would come true more than 5 out of 10 times. (And the opposite is true for the green curve.)

In summary, the following concepts fall out of this framework:

Predictions can be based on more or less information.
- In expectation, more information moves the prediction in the right direction. However, every one piece of information may go in the wrong direction. We can visualize this relationship using an information chart.
Predictions can have better or worse calibration.
- Calibration is about correctly estimating the remaining amount of uncertainty.^[4]

(The attentive reader may also notice that “50% predictions can be meaningful” follows as an easy corollary from the above.)

My primary claim in this post is that information charts are the proper way to look at predictions – not just in cases where we know all of the factors, but always. It may be difficult or impossible to know what the information chart looks like in practice, but it exists and should be considered the philosophical ground truth. (Information isn’t a one-dimensional spectrum, but I don’t think that matters all that much; more on that later.^[5]) Doing so seems to resolve most philosophical problems.

I don’t think it is in principle possible to weigh calibration and information against each other. Good calibration removes bias; more information moves predictions further away from the baseline. Depending on the use case, you may rather have a bunch of bold predictions that are slightly miscalibrated or a bunch of cautious predictions with perfect calibration. However, it does seem safe to say that:

quality is a function of {amount of information} and {calibration}; and
it increases monotonically with each coordinate. (I.e., all else equal, more information means a higher quality prediction, and better calibration means a higher-quality prediction.)

3. A Use Case: The 2020 Election

Here is an example of where I think the framework helps resolve philosophical questions.

On the eve of the 2020 election, the model of 538 coded by Nate Silver has predicted an 89% probability of Biden winning a majority in the Electoral College (with 10% for Trump and 1% for a tie). At the same time, a weighted average of prediction markets had Biden at around 63% for becoming the next president.^[6] At this point, we know that

Biden has won; however
there was a set of four states^[7] such that, had the voting gap changed by ~0.7% in favor of Trump in all four, Trump would have won the Electoral College instead.

The first convenient assumption we will make here is that both predictions had perfect calibration. (This is arguably almost true.) Given that, the only remaining question is which prediction was made with more information.

To get my point across, it will be convenient to assume that we know how the information chart for this prediction looks like:

If you accept this, then there are two stories we can tell about 538 vs. Betting markets.^[8]

Story 1: The Loyal Trump Voter

In the first story, no-one had foreseen the real reasons why polls were off in favor of Trump; they may as well have been off in favor of Biden. Consequently, no-one had good reasons to assign Biden a <89% chance of winning, and the people who did so anyway would have rejected their reasons if they had better information/were more rational.

If the above is true, it means everyone who bet on Biden, as some on LessWrong have advised, has taken a good deal. However, there is also a different story.

Story 2: The Efficient Market

In the second story, the markets knew something that modelers didn’t: a 2016-style polling error in the same direction was to be expected. They didn’t know how large it would be exactly, but they knew it was large enough for 63% to be a better guess than 89% (and perhaps the implied odds by smart gamblers were even lower, and the price only came out 63% because people who trusted 538 bought overpriced Biden shares). The outcome we did get (a ~0.7% win for Biden) was among the most likely outcomes as projected by the market.

Alas, betting on Biden was a transaction which, from the perspective of someone knowing what the market knew, had negative expected returns.

In reality, there was probably at least a bit of both worlds going on, and the information chart may not be accurate in the first place. Given that, either of the two scenarios above may or may not describe the primary mechanism by which pro-Trump money entered the markets. However, even if you reject them both, the only specific claim I’m making here is that the election could have been such that the probability changed non-monotonically with the amount of input information, i.e.:

knowing more may have shifted the odds in one direction;
knowing more still may have shifted them in the other;
if so, both would have represented an improvement in the quality of the prediction.

If true, this yields a non-injective function, meaning that there are specific probabilities that have been implied by several positions on the information chart, such as the 63% for Biden. Because of this, we cannot infer the quality of the prediction solely based on the stated probability.

And yes: real information is not one-dimensional. However, the principles still work in a 1000000-dimensional space:

The amount of information is still anti-proportional to the distance between {current point} and {point of perfect information}. (Distances can be measured in arbitrarily high-dimensional spaces.)
Calibration is still about estimating this distance.
If false information is included, the only thing that changes is that now there are points on the chart whose [distance to the point of full information] is larger than the distance between no information and full information.

Thus, to sum up this post, these are the claims I strongly believe to be true:

One cannot go directly from “Biden won” to “therefore, 538′s prediction was better since it had more probability mass on the correct outcome”.
One cannot go directly from “the election was close” to “therefore, the market’s prediction was better since it implied more narrow margins”.

And, perhaps most controversially:

There is nothing in principle missing from the information chart picture. In particular, there is no meaningful way to say something like “even though it is true that gamblers who bet on Trump would not have done so if they were smarter/better informed, their prediction was still better than that of 538”. If the premise is true (which it may not be), then the market’s/538′s predictions are analogous to two points on the blue curve in our coin-flip game, with 538 being further to the right.^[9]

4. Appendix: Correct Probabilities and Scoring Functions

I’ve basically said what I wanted to say at this point – the fourth chapter is there to overexplain/make more arguments.

One thing that falls out of this post’s framework is that it makes sense to say that one prediction (and in extension, one probability) is better than another, but it doesn’t make sense to talk about the correct probability – unless ‘correct’ is defined as the point of full information, in which case it is usually unattainable.

This also means that there are different ways to assign probabilities to a set of statements that are all perfectly calibrated. For example, consider the following eight charts that come from eight runs of the 100-coins game:

There are many ways of obtaining a set of perfectly calibrated predictions from these graphs. The easiest is to throw away all information and go with the prior every time (which is the starting point on every graph). This yields eight predictions that all claim a 0.69135 chance of winning.^[10] Alternatively, we can cut off each chart after the halfway point:

This gives us a set of eight predictions that have different probabilities from the first set, despite predicting the same thing – and yet, they are also perfectly calibrated. Again, unless we consider the point of full information, there is no ‘correct’ probability, and the same chart may feature a wide range of perfectly calibrated predictions for the same outcome.

You probably know that there are scoring functions for predictions. Our framework suggests that the quality of predictions is a function of {amount of information} and {calibration}, which begs the question of which of the two things scoring functions measure. (We would hope that they measure both.) What we can show is that, for logarithmic scoring,

all else equal, better calibration leads to a better score; and
assuming perfect calibration, more information leads to a better score.

The second of these properties implies that, the later we cut off our blue curves, the better a score the resulting set of predictions will obtain – in expectation.

Let’s demonstrate both of these. First, the rule. Given a set of predictions $p_{1}, . . ., p_{n}$ (with $p_{i} \in [0, 1]$ ) and a set of outcomes $y_{1}, . . ., y_{n}$ (with $y_{i} \in {0, 1}$ ), logarithmic scoring assigns to this set the number

$n \sum i = 1 [y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})]$

Since the $y_{i}$ are either 1 or 0, the formula amounts to summing up [the logarithms of the probability mass assigned to the true outcome] across all our predictions. E.g., if I make five 80% predictions and four of them come true, I sum up $4 log (0.8) + log (0.2)$ .

Note that these terms are all negative, so the highest possible score is the one with the smallest absolute value. Note also that $log (x)$ converges to $- \infty$ as $x$ goes to $0$ : this corresponds to the fact that, if you increase your confidence in a prediction but are wrong, your punishment grows indefinitely. Predicting 0% for something that comes true yields a score of $- \infty$ .

I think the argument about calibration is the less interesting part (I don’t imagine anyone is surprised that logarithmic scoring rewards good calibration), so I’ve relegated it into a footnote.^[11]

Let’s look at information. Under the assumption of perfect calibration, we know that any prediction-we-have-assigned-probability- $p$ will, indeed, come true with probability $p$ . Thus, the expected score for such a prediction^[12] is

$L (p) := p log (p) + (1 - p) log (1 - p)$

We can plot $L$ for $p \in [0, 1]$ . It looks like this:

This shows us that, for any one prediction, a more confident verdict is preferable, provided calibration is perfect. That alone does not answer our question. If we increase our amount of information – if we take points further to the right on our blue curves – some predictions will have increased probability, others will have decreased probability. You can verify this property by looking at some of the eight charts I’ve pictured above. What we can say is that

in expectation, the sum of all probabilities remains constant (otherwise, either set of predictions would not be perfectly calibrated); and
in expectation, our values move further apart (I don’t have an easy argument to demonstrate this, but it seems intuitively obvious).

Thus, the question is whether moving away from $p$ in both directions, such that the total probability mass remains constant, yields a higher score. In other words, we want that

$2 L (p) < L (p - ϵ) + L (p + ϵ) \forall ϵ \in (0, min {p, 1 - p})$

Fortunately, this inequality is immediate from the fact that $L$ is strictly convex (which can be seen from the graph pictured above).^[13] Similar things are true for the Brier score.^[14]

↩︎
I.e., the probability of having 52 heads total in 100 flips, conditional on the flips we’ve already seen. For example, the first two flips have come up tails in our case, so after flip #2, we’re hoping for at most 52 heads in the next 98 flips. The probability for this is $P (X \leq 52)$ with $X \sim B (98, \frac{1}{2})$ , which is about 76%. Similarly, we’ve had 12 heads and 10 tails after 22 flips, so after flip #22, we’re hoping for at most 40 heads in the next 78 flips. The probability for this is $P (X \leq 40)$ with $X \sim B (78, \frac{1}{2})$ , which is about 63%.
↩︎
To spell this out a bit more: we would run the game a million times and create a chart like the one I’ve shown for each game. Since each chart features points at 101 $x$ -positions, we can consider these 101 million predictions about whether a game was won. We also know how to score these predictions since we know which games were won and which were lost. (For example, in our game, all predictions come out false since we lost the game.)

Then, we can take all predictions that assign a probability between 0.48 and 0.52 and put them into the ’50%′ bin. Ideally, around half of the predictions in this bin should come true – and this is, in fact, what will happen. As you make the bins smaller and increase the number of runs (go from a million to a billion etc.), the chance that the bins are wrong by at least $ϵ$ converges to 0, for every value of $ϵ \in R_{+}$ .

All of the above is merely a formal way of saying that these predictions are perfectly calibrated.
↩︎
To be precise, what I’ve done is to take the function $f (x) = x - \frac{x^{2}}{100}$ , which looks like this:

and use that to model how deluded each prediction is about the amount of information it has access to. I.e., after the 50-th flip, the red curve assumes it has really seen $50 + f (50) = 75$ flips, and that only 25 are missing. The value of those 75 flips is extrapolated from the 50, so if exactly half of the real 50 have come up heads, it assumes that 37.5 of the 75 have come up heads. In this case, this would increase its confidence that the final count of heads will be 52 or lower.

In general, after having seen $n$ flips, the red curve assumes it knows $n + f (n)$ flips. Since $f (0) = 0$ and $f (100) = 0$ , it starts and ends at the same point as the blue curve. Similarly, the green curve assumes it knows $n - f (n)$ many flips.

This may not be the best way to model overconfidence since it leads to 100% and 0% predictions. Then again, real people do that, too.
↩︎
It’s worth pointing out that this makes calibration a property of single predictions, not an emergent property of sets of predictions. This seems like a good thing to me; if someone predicts a highly complex sequence of events with 50% probability, I generally don’t feel that I require further predictions to judge its calibration.
↩︎
Furthermore, ‘information’ isn’t restricted to the literal input data but is meant to be a catch-all for everything that helps to predict something, including better statistical models or better intuitions.
↩︎
The outcomes for these predictions may come apart if someone who didn’t win the election becomes president. (Although BetFair supposedly has ‘projected winner’ as its criterion.)
↩︎
That’s Pennsylvania (~0.7% margin), Wisconsin (~0.7% margin), Georgia (~0.2% margin), and Arizona (~0.3% margin).
↩︎
Note that there are a lot more (and better) markets than PredictIt, I’m just using it in the image because it has a nice logo.
↩︎
To expand on this more: arguing that the market’s prediction was better solely based on the implied margin seems to me to be logically incoherent:
- Biden won, so the naive view (which I strongly reject) says that higher probabilities were better.
- Deviating from the naive view implicitly assumes that confidently predicting a narrow win was too hard to be plausible, which is an argument about the information chart, unless the claim is that it’s impossible due to quantum randomness. In particular, making this argument implicitly draws the ‘implausible’ region of the chart (i.e., the part that you don’t believe anyone can enter) around just the final slope, so that betting 89%+ for good reasons was infeasible, but betting ~63% for good reasons was feasible. This may well be true, but making this argument acknowledges the existence of an information chart.
↩︎
Incidentally, this set of eight predictions appears poorly calibrated: given their stated probability of 0.69135, we would expect about five and a half to come true (so 5 or 6 would be the good results), yet only 4 did. However, this is an artifact of our sample being small. Perfect calibration does not imply perfect-looking calibration on any fixed sample size; it only implies that the probability of apparent calibration being off by some fixed amount converges to zero as the sample size grows.
↩︎
Consider a set of predictions from the same bin, i.e., that we have assigned the same probability, $p$ . Suppose their real frequency is $p^{*}$ . We would now hope that the probability which maximizes our score is $p^{*}$ . For each prediction in this bin, since it will come true with probability $p^{*}$ , we will have probability $p^{*}$ to receive the score $log (p)$ and probability $1 - p^{*}$ to receive the score $log (1 - p^{*})$ . In other words, our expected score is

$p^{*} log (p) + (1 - p^{*}) log (1 - p)$

To find out what value of $p$ minimizes this function, we take the derivative:

$\frac{p^{*}}{p} - \frac{1 - p^{*}}{1 - p}$

This term is 0 iff $\frac{p}{1 - p} = \frac{p^{*}}{1 - p^{*}}$ . Since the function $ϕ (x) = \frac{x}{1 - x}$ is injective, we can apply $ϕ^{- 1}$ to both sides and obtain $p = p^{*}$ as the unique solution. Thus, calibration is indeed rewarded.
↩︎
It suffices to consider a single prediction since the scoring function is additive across predictions.
↩︎
Strict convexity says that, for all $δ \in (0, 1)$ , we have

$L (δ x + [1 - δ] y) < δ L (x) + [1 - δ] L (y)$

Set $δ = \frac{1}{2}$ and $x = p - ϵ$ and $y = p + ϵ$ , then multiply the equation by $2$ .
↩︎
The Brier score measures negative squared distance to the outcomes, scaled by $\frac{1}{n}$ . I.e., in the notation we’ve used for logarithmic scoring, we assign the number

$- \frac{1}{n} n \sum i = 1 (p_{i} - y_{i})^{2}$

The two properties we’ve verified for logarithmic scoring hold for the Brier score as well. Assuming perfect calibration, the expected Brier score for a prediction with probability $p$ is $- p (1 - p)^{2} - (1 - p) p^{2}$ . The corresponding graph looks like this:

Since this function is also strictly convex, the second property is immediate.

However, unlike logarithmic scoring, Brier score has bounded penalties. Predicting 0% for an outcome that occurs yields a score of $- 1$ rather than $- \infty$ .

Information Charts

1. Motivation

2. The Framework

3. A Use Case: The 2020 Election

4. Appendix: Correct Probabilities and Scoring Functions