Are you saying that you think I’m doing something that isn’t mathematically legitimate? I don’t think I am.
I’m taking a sum of a bunch of logarithms and then dividing that by the sum of another bunch of logarithms. In the calculations above I used base e logarithms everywhere. Perhaps you are worried that the answer 97.2% depends on the base used. But if I had changed base to base x then the change of base formula says that I would have to divide the logarithms by log(x) (i.e. the base e logarithm of x). Since I would be dividing both the top and the bottom of my fraction by the same amount the answer would be unchanged.
Or perhaps you are worried that the final answer doesn’t actually depend on the calibration? Let’s pretend that for Scott’s 90% predictions (of which he got 94% correct) he had actually said 85%. This would make his calibration worse, so lets see if my measure of his calibration goes down. His log score has worsened from −0.462 to −0.467. His “monotonically optimised” score remains the same (he now wishes he had said 94% instead of 85%) so it is still −0.448. Hence his calibration has decreased from −0.448/-0.462 = 97.2% to −0.448/-0.467 = 96.0%. This shows that his calibration has in fact got worse, so everything seems to be in working order.
EDIT: What the change-of-base formula does show is that my “calibration score” can also be thought of as the logarithm of the product of all the probabilities I assigned to the outcomes that did happen taken to the base of the product of the monotonically improved versions of the same probabilities. That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
Perhaps you are worried that the answer 97.2% depends on the base used.
Or perhaps you are worried that the final answer doesn’t actually depend on the calibration?
No and no. I know all those things that you wrote.
That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
It’s “legit” in the sense that the operation is well defined… but it’s not doing the work you’d like it to do.
Your number is “locally” telling you which direction is towards more calibration, but is not meaningful outside of the particular configuration of predictions. And you already can guess that direction. What you need is to quantify something that is meaningful for different sets of predictions.
Example:
If made 3 predictions at 70% confidence, 2 failed and 1 was correct.
Mean log score: (ln(0.3)*2 + ln(0.7)) / 3 = −0.92154
If I said 33% confidence: (ln(0.67)*2 + ln(0.33)) / 3 = −0.63654
Your score is: 69%
If made 3 predictions such that 1 failed and 1 was correct at 70%, and 1 failed at 60%:
It will take me a while to think about this in more detail, but for now I’ll just note that I was demanding that we fix 50% at 50%, so 60% can’t be adjusted to 0% but only down to 50%. So in the second case the score is log(0.5)*3/(log(0.3)+log(0.7)+log(0.4)) = 84.0% which is higher.
I think my measure should have some nice properties which justify it, but I’ll take a while to think about what they are.
EDIT: I’d say now that it might be better to take the difference rather than the ratio. Otherwise you’ll look better calibrated on difficult problems just because your score will be worse overall.
Are you saying that you think I’m doing something that isn’t mathematically legitimate? I don’t think I am.
I’m taking a sum of a bunch of logarithms and then dividing that by the sum of another bunch of logarithms. In the calculations above I used base e logarithms everywhere. Perhaps you are worried that the answer 97.2% depends on the base used. But if I had changed base to base x then the change of base formula says that I would have to divide the logarithms by log(x) (i.e. the base e logarithm of x). Since I would be dividing both the top and the bottom of my fraction by the same amount the answer would be unchanged.
Or perhaps you are worried that the final answer doesn’t actually depend on the calibration? Let’s pretend that for Scott’s 90% predictions (of which he got 94% correct) he had actually said 85%. This would make his calibration worse, so lets see if my measure of his calibration goes down. His log score has worsened from −0.462 to −0.467. His “monotonically optimised” score remains the same (he now wishes he had said 94% instead of 85%) so it is still −0.448. Hence his calibration has decreased from −0.448/-0.462 = 97.2% to −0.448/-0.467 = 96.0%. This shows that his calibration has in fact got worse, so everything seems to be in working order.
EDIT: What the change-of-base formula does show is that my “calibration score” can also be thought of as the logarithm of the product of all the probabilities I assigned to the outcomes that did happen taken to the base of the product of the monotonically improved versions of the same probabilities. That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
No and no. I know all those things that you wrote.
It’s “legit” in the sense that the operation is well defined… but it’s not doing the work you’d like it to do.
Your number is “locally” telling you which direction is towards more calibration, but is not meaningful outside of the particular configuration of predictions. And you already can guess that direction. What you need is to quantify something that is meaningful for different sets of predictions.
Example:
If made 3 predictions at 70% confidence, 2 failed and 1 was correct.
Mean log score: (ln(0.3)*2 + ln(0.7)) / 3 = −0.92154
If I said 33% confidence: (ln(0.67)*2 + ln(0.33)) / 3 = −0.63654
Your score is: 69%
If made 3 predictions such that 1 failed and 1 was correct at 70%, and 1 failed at 60%:
Mean log score: (ln(0.3) + ln(0.7) + ln(0.4)) / 3 = −0.82565
If I said 50% instead of 70%, and 0% instead of 60%: (ln(0.5)*2 + ln(1)) / 3 = −0.462098
Your score is: 56%
Have you noticed that failing a prediction at 60% is clearly better than failing the same prediction at 70%?
However, your score is less in the former case.
Please forgive me if I sound patronizing. But inventing scoring rules is a tricky business, and it requires some really careful thinking.
Okay I understand what you’re saying now.
It will take me a while to think about this in more detail, but for now I’ll just note that I was demanding that we fix 50% at 50%, so 60% can’t be adjusted to 0% but only down to 50%. So in the second case the score is log(0.5)*3/(log(0.3)+log(0.7)+log(0.4)) = 84.0% which is higher.
I think my measure should have some nice properties which justify it, but I’ll take a while to think about what they are.
EDIT: I’d say now that it might be better to take the difference rather than the ratio. Otherwise you’ll look better calibrated on difficult problems just because your score will be worse overall.