The nice thing about this decomposition is that it gives you more information than a single score. The uncertainty is a sort of ‘difficulty’ score, it doesn’t take predictions into account and is minimized when the same outcome occurs each time.
The resolution tells you how much information each prediction gives. For an event that occurs half of the time you could predict 0.5 probability for everything but if you knew more about what was going on then maybe you could predict a 1 or a 0. This is a much stronger statement so the resolution gives you credit for that.
Reliability is then much like the scoring metric you describe. It is minimized (which is good, since it’s a loss score) when all of the events you predict with 0.2 occur 20% of the time; that is, when your predictions match the uncertainty.
All of this happens at arbitrary precision, it’s just operations on real vectors so the only limit is your floating-point size.
n_k the number of forecasts with the same probability category
Indicate that this is using histogram buckets? I’m trying to say I’m looking for methods that avoid grouping probabilities into an arbitrary (chosen by the analyst) number of categories. For instance.. in the (possibly straw) histogram method that I discussed in the question, if a predictor makes a lot of 0.97 bets and no corresponding 0.93 bets, their [0.9 1] category will be called slightly pessimistic about its predictions even if those forecasts came true exactly 0.97 of the time, I wouldn’t describe anything in that genre as exact, even if it is the best we have.
One thing you might look at is the Brier Score, particularly the 3-component decomposition.
Score = Reliability—Resolution + Uncertainty
The nice thing about this decomposition is that it gives you more information than a single score. The uncertainty is a sort of ‘difficulty’ score, it doesn’t take predictions into account and is minimized when the same outcome occurs each time.
The resolution tells you how much information each prediction gives. For an event that occurs half of the time you could predict 0.5 probability for everything but if you knew more about what was going on then maybe you could predict a 1 or a 0. This is a much stronger statement so the resolution gives you credit for that.
Reliability is then much like the scoring metric you describe. It is minimized (which is good, since it’s a loss score) when all of the events you predict with 0.2 occur 20% of the time; that is, when your predictions match the uncertainty.
All of this happens at arbitrary precision, it’s just operations on real vectors so the only limit is your floating-point size.
Doesn’t
Indicate that this is using histogram buckets? I’m trying to say I’m looking for methods that avoid grouping probabilities into an arbitrary (chosen by the analyst) number of categories. For instance.. in the (possibly straw) histogram method that I discussed in the question, if a predictor makes a lot of 0.97 bets and no corresponding 0.93 bets, their [0.9 1] category will be called slightly pessimistic about its predictions even if those forecasts came true exactly 0.97 of the time, I wouldn’t describe anything in that genre as exact, even if it is the best we have.