Please help me understand how someone is not miscalibrated when things they assign a high probability actually occur with a low frequency. Under what definitions are you working?
Also:
Failing to assign the correct probability given your information is a failure both of accuracy and of calibration.
Suppose you take a test of many multiple choice questions (say, 5 choices each), and for each question I elicit from you your probability of having the right answer. Accuracy is graded by your total score on the test. Calibration is graded by your log-score on the probabilities. Our lottery enthusiast might think they’re 50% likely to have the right answer even when they don’t have any information distinguishing the answers—and because of this they will have a lower log score than someone who correctly thinks they have a 1⁄5 chance. These two people may have the same scores on the test, but they will have different scores on their ability to assign probabilities.
Please help me understand how someone is not miscalibrated when things they assign a high probability actually occur with a low frequency.
Because you’re taking ‘high’ to mean ‘larger than accurate but still not >50%’, and as noted above in the post, normalization of probability means that the opposite answer will balance it. They are right and wrong at exactly the right rate. The symmetrized calibration curve will be exactly right.
But, they could be doing much better if they took that information into account.
As for the multiple-choice questions, your example is odd—“Calibration is graded by your log-score on the probabilities”—that’s not what calibration means. By that metric, someone who is perfectly calibrated but 60% accurate would lose to someone who answers perfectly but is strongly underconfident and doesn’t use probabilities more extreme than 75%.
The latter person is worse-calibrated than the first.
As for the multiple-choice questions, your example is odd—“Calibration is graded by your log-score on the probabilities”—that’s not what calibration means. By that metric, someone who is perfectly calibrated but 60% accurate would lose to someone who answers perfectly but is strongly underconfident and doesn’t use probabilities more extreme than 75%.
Yeah—the log probability score increases with better calibration, but it also increases with accuracy (there’s a nice information-theoretic pattern to this). I agree that this is not a good property for a measurement of calibration for inter-personal comparison, you’re right.
The latter person is worse-calibrated than the first.
Would you say that they’re even worse calibrated if they answered perfectly but always said that their probability of being right was 50%, or would that make them perfectly-calibrated again?
50% may be ‘perfectly’ calibrated, but it’s automatically perfectly calibrated and so doesn’t really say anything.
From the way it was put together, though, we can say that would be worse-calibrated: look at their probabilities for the wrong answers, then either they put 50% on those too, in which case it’s silly to say they were right (since they’re ignoring even the exclusive structure of the question), or they gave miscalibrated probabilities on them.
50% is special because its log (or similar) score is insensitive to whether you are right or wrong. Therefore if you vary how good the agent’s information was (the accuracy), you cannot decrease their score if they answered 50% - it’s ‘perfect’.
On the other hand, if you keep how good the agent’s information was fixed, and vary which probabilities the agent choses, 50% is nothing special—it’s not any kind of special stopping place, and if you have any asymmetrical information you can get a better score by using a different probability.
When talking about calibration, you were asking “were the probability assignments falsified by the data?”, in which case 50% is special because it’s insensitive to changes in the data. But I think the much better question is “how well did the person use their information to assign probabilities?”—don’t vary how good the person’s information is, only vary how they assign probabilities. In this case, if someone gets a 70% score on a test but thinks they only got a 50%, they’re poorly calibrated because they could use their information to get right answers, but they couldn’t use it to get good probabilities, and should vary what probability they assign to answers that feel like the answers did on that test.
Please help me understand how someone is not miscalibrated when things they assign a high probability actually occur with a low frequency. Under what definitions are you working?
Also:
Because you’re taking ‘high’ to mean ‘larger than accurate but still not >50%’, and as noted above in the post, normalization of probability means that the opposite answer will balance it. They are right and wrong at exactly the right rate. The symmetrized calibration curve will be exactly right.
But, they could be doing much better if they took that information into account.
As for the multiple-choice questions, your example is odd—“Calibration is graded by your log-score on the probabilities”—that’s not what calibration means. By that metric, someone who is perfectly calibrated but 60% accurate would lose to someone who answers perfectly but is strongly underconfident and doesn’t use probabilities more extreme than 75%.
The latter person is worse-calibrated than the first.
Yeah—the log probability score increases with better calibration, but it also increases with accuracy (there’s a nice information-theoretic pattern to this). I agree that this is not a good property for a measurement of calibration for inter-personal comparison, you’re right.
Would you say that they’re even worse calibrated if they answered perfectly but always said that their probability of being right was 50%, or would that make them perfectly-calibrated again?
50% may be ‘perfectly’ calibrated, but it’s automatically perfectly calibrated and so doesn’t really say anything.
From the way it was put together, though, we can say that would be worse-calibrated: look at their probabilities for the wrong answers, then either they put 50% on those too, in which case it’s silly to say they were right (since they’re ignoring even the exclusive structure of the question), or they gave miscalibrated probabilities on them.
Perhaps I have an explanation.
50% is special because its log (or similar) score is insensitive to whether you are right or wrong. Therefore if you vary how good the agent’s information was (the accuracy), you cannot decrease their score if they answered 50% - it’s ‘perfect’.
On the other hand, if you keep how good the agent’s information was fixed, and vary which probabilities the agent choses, 50% is nothing special—it’s not any kind of special stopping place, and if you have any asymmetrical information you can get a better score by using a different probability.
When talking about calibration, you were asking “were the probability assignments falsified by the data?”, in which case 50% is special because it’s insensitive to changes in the data. But I think the much better question is “how well did the person use their information to assign probabilities?”—don’t vary how good the person’s information is, only vary how they assign probabilities. In this case, if someone gets a 70% score on a test but thinks they only got a 50%, they’re poorly calibrated because they could use their information to get right answers, but they couldn’t use it to get good probabilities, and should vary what probability they assign to answers that feel like the answers did on that test.