Attempt #1: note that, for any probability p, you can compute “number of predictions you made with probability less than p that came true”. If you’re perfectly-calibrated, then this should be a random variable with:
mean = sum(q for q in prediction_probs if q<p)
variance = sum(q*(1-q) for q in prediction_probs if q<p)
Let’s see what this looks like if we plot it as a function of p. Let’s consider three people:
one perfectly-calibrated (green)
one systematically overconfident (red) (i.e. when they say “1%” or “99%” the true probability is more like 2% or 98%)
one systematically underconfident (green) (i.e. when they say “10%” or “90%” the true probability is more like 5% or 95%).
Let’s have each person make 1000 predictions with probabilities uniformly distributed in [0,1]; and then sample outcomes for each set of predictions and plot out their num-true-predictions-below functions.
(The gray lines show the mean and first 3 stdev intervals for a perfectly calibrated predictor.)
Hrrm. The y-axis is too big to see the variation, Let’s subtract off the mean.
And to get a feeling for how else this plot could have looked, let’s run 100 more simulations for each the three people:
Okay, this is pretty good!
The overconfident (red) person tends to see way too many 1%-20% predictions come true, as evidenced by the red lines quickly rising past the +3stdev line in that range.
The underconfident (blue) person sees way too few 10%-40% predictions come true, as evidenced by the blue lines falling past the −3stdev line in that range.
The perfect (green) person stays within 1-2stdev of the mean.
But it’s not perfect: everything’s too squished together on the left to see what’s happening—a predictor could be really screwing up their very-low-probability predictions and this graph would hide it. Possibly related to that squishing, I feel like the plot should be right-left symmetric, to reflect the symmetries of the predictors’ biases. But it’s not.
Attempt #2: the same thing, except instead of plotting
sum((1 if came_true else 0) for q in prediction_probs if q<p)
we plot
sum(-log(prob you assigned to the correct outcome) for q in prediction_probs if q<p)
i.e. we measure the total “surprisal” for all your predictions with probability under p. (I’m very fond of surprisal; it has some very appealing information-theory-esque properties.)
On the bright side, this plot has less overlap between the three predictors’ typical sets of lines. And the red curves look… more symmetrical, kinda, like an odd function, if you squint. Same for the blue curves.
On the dark side, everything is still too squished together on the left. (I think this is a problem inherent to any “sum(… for q in prediction_probs if q<p)” function. I tried normalizing everything in terms of stdevs, but it ruined the symmetry and made everything kinda crazy on the left-hand side.)
Log of my attempts so far:
Attempt #1: note that, for any probability p, you can compute “number of predictions you made with probability less than p that came true”. If you’re perfectly-calibrated, then this should be a random variable with:
Let’s see what this looks like if we plot it as a function of p. Let’s consider three people:
one perfectly-calibrated (green)
one systematically overconfident (red) (i.e. when they say “1%” or “99%” the true probability is more like 2% or 98%)
one systematically underconfident (green) (i.e. when they say “10%” or “90%” the true probability is more like 5% or 95%).
Let’s have each person make 1000 predictions with probabilities uniformly distributed in [0,1]; and then sample outcomes for each set of predictions and plot out their num-true-predictions-below functions. (The gray lines show the mean and first 3 stdev intervals for a perfectly calibrated predictor.)
Hrrm. The y-axis is too big to see the variation, Let’s subtract off the mean.
And to get a feeling for how else this plot could have looked, let’s run 100 more simulations for each the three people:
Okay, this is pretty good!
The overconfident (red) person tends to see way too many 1%-20% predictions come true, as evidenced by the red lines quickly rising past the +3stdev line in that range.
The underconfident (blue) person sees way too few 10%-40% predictions come true, as evidenced by the blue lines falling past the −3stdev line in that range.
The perfect (green) person stays within 1-2stdev of the mean.
But it’s not perfect: everything’s too squished together on the left to see what’s happening—a predictor could be really screwing up their very-low-probability predictions and this graph would hide it. Possibly related to that squishing, I feel like the plot should be right-left symmetric, to reflect the symmetries of the predictors’ biases. But it’s not.
Attempt #2: the same thing, except instead of plotting
we plot
i.e. we measure the total “surprisal” for all your predictions with probability under p. (I’m very fond of surprisal; it has some very appealing information-theory-esque properties.)
On the bright side, this plot has less overlap between the three predictors’ typical sets of lines. And the red curves look… more symmetrical, kinda, like an odd function, if you squint. Same for the blue curves.
On the dark side, everything is still too squished together on the left. (I think this is a problem inherent to any “sum(… for q in prediction_probs if q<p)” function. I tried normalizing everything in terms of stdevs, but it ruined the symmetry and made everything kinda crazy on the left-hand side.)