Perhaps you are worried that the answer 97.2% depends on the base used.
Or perhaps you are worried that the final answer doesn’t actually depend on the calibration?
No and no. I know all those things that you wrote.
That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
It’s “legit” in the sense that the operation is well defined… but it’s not doing the work you’d like it to do.
Your number is “locally” telling you which direction is towards more calibration, but is not meaningful outside of the particular configuration of predictions. And you already can guess that direction. What you need is to quantify something that is meaningful for different sets of predictions.
Example:
If made 3 predictions at 70% confidence, 2 failed and 1 was correct.
Mean log score: (ln(0.3)*2 + ln(0.7)) / 3 = −0.92154
If I said 33% confidence: (ln(0.67)*2 + ln(0.33)) / 3 = −0.63654
Your score is: 69%
If made 3 predictions such that 1 failed and 1 was correct at 70%, and 1 failed at 60%:
It will take me a while to think about this in more detail, but for now I’ll just note that I was demanding that we fix 50% at 50%, so 60% can’t be adjusted to 0% but only down to 50%. So in the second case the score is log(0.5)*3/(log(0.3)+log(0.7)+log(0.4)) = 84.0% which is higher.
I think my measure should have some nice properties which justify it, but I’ll take a while to think about what they are.
EDIT: I’d say now that it might be better to take the difference rather than the ratio. Otherwise you’ll look better calibrated on difficult problems just because your score will be worse overall.
No and no. I know all those things that you wrote.
It’s “legit” in the sense that the operation is well defined… but it’s not doing the work you’d like it to do.
Your number is “locally” telling you which direction is towards more calibration, but is not meaningful outside of the particular configuration of predictions. And you already can guess that direction. What you need is to quantify something that is meaningful for different sets of predictions.
Example:
If made 3 predictions at 70% confidence, 2 failed and 1 was correct.
Mean log score: (ln(0.3)*2 + ln(0.7)) / 3 = −0.92154
If I said 33% confidence: (ln(0.67)*2 + ln(0.33)) / 3 = −0.63654
Your score is: 69%
If made 3 predictions such that 1 failed and 1 was correct at 70%, and 1 failed at 60%:
Mean log score: (ln(0.3) + ln(0.7) + ln(0.4)) / 3 = −0.82565
If I said 50% instead of 70%, and 0% instead of 60%: (ln(0.5)*2 + ln(1)) / 3 = −0.462098
Your score is: 56%
Have you noticed that failing a prediction at 60% is clearly better than failing the same prediction at 70%?
However, your score is less in the former case.
Please forgive me if I sound patronizing. But inventing scoring rules is a tricky business, and it requires some really careful thinking.
Okay I understand what you’re saying now.
It will take me a while to think about this in more detail, but for now I’ll just note that I was demanding that we fix 50% at 50%, so 60% can’t be adjusted to 0% but only down to 50%. So in the second case the score is log(0.5)*3/(log(0.3)+log(0.7)+log(0.4)) = 84.0% which is higher.
I think my measure should have some nice properties which justify it, but I’ll take a while to think about what they are.
EDIT: I’d say now that it might be better to take the difference rather than the ratio. Otherwise you’ll look better calibrated on difficult problems just because your score will be worse overall.