It’s not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here’s log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
And how is the “actual accuracy” calculated?
It assumes that you predict every event with the same confidence (namely prediction_confidence) and then that you’re correct on actual_accuracy of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you’ll get a Brier score of 0.25 (ie 3⁄4 of the way up the right-hand said of the graph).
Of course typically people predict different events with different confidences—but since overall Brier score is the simple average of the Brier scores on individual events, that part’s reasonably intuitive.
It’s not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here’s log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Brier score
Log-loss
Interesting. Question: Why does the prediction confidence start at 0.5? And how is the “actual accuracy” calculated?
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
It assumes that you predict every event with the same confidence (namely
prediction_confidence
) and then that you’re correct onactual_accuracy
of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you’ll get a Brier score of 0.25 (ie 3⁄4 of the way up the right-hand said of the graph).Of course typically people predict different events with different confidences—but since overall Brier score is the simple average of the Brier scores on individual events, that part’s reasonably intuitive.