50% may be ‘perfectly’ calibrated, but it’s automatically perfectly calibrated and so doesn’t really say anything.
From the way it was put together, though, we can say that would be worse-calibrated: look at their probabilities for the wrong answers, then either they put 50% on those too, in which case it’s silly to say they were right (since they’re ignoring even the exclusive structure of the question), or they gave miscalibrated probabilities on them.
50% is special because its log (or similar) score is insensitive to whether you are right or wrong. Therefore if you vary how good the agent’s information was (the accuracy), you cannot decrease their score if they answered 50% - it’s ‘perfect’.
On the other hand, if you keep how good the agent’s information was fixed, and vary which probabilities the agent choses, 50% is nothing special—it’s not any kind of special stopping place, and if you have any asymmetrical information you can get a better score by using a different probability.
When talking about calibration, you were asking “were the probability assignments falsified by the data?”, in which case 50% is special because it’s insensitive to changes in the data. But I think the much better question is “how well did the person use their information to assign probabilities?”—don’t vary how good the person’s information is, only vary how they assign probabilities. In this case, if someone gets a 70% score on a test but thinks they only got a 50%, they’re poorly calibrated because they could use their information to get right answers, but they couldn’t use it to get good probabilities, and should vary what probability they assign to answers that feel like the answers did on that test.
50% may be ‘perfectly’ calibrated, but it’s automatically perfectly calibrated and so doesn’t really say anything.
From the way it was put together, though, we can say that would be worse-calibrated: look at their probabilities for the wrong answers, then either they put 50% on those too, in which case it’s silly to say they were right (since they’re ignoring even the exclusive structure of the question), or they gave miscalibrated probabilities on them.
Perhaps I have an explanation.
50% is special because its log (or similar) score is insensitive to whether you are right or wrong. Therefore if you vary how good the agent’s information was (the accuracy), you cannot decrease their score if they answered 50% - it’s ‘perfect’.
On the other hand, if you keep how good the agent’s information was fixed, and vary which probabilities the agent choses, 50% is nothing special—it’s not any kind of special stopping place, and if you have any asymmetrical information you can get a better score by using a different probability.
When talking about calibration, you were asking “were the probability assignments falsified by the data?”, in which case 50% is special because it’s insensitive to changes in the data. But I think the much better question is “how well did the person use their information to assign probabilities?”—don’t vary how good the person’s information is, only vary how they assign probabilities. In this case, if someone gets a 70% score on a test but thinks they only got a 50%, they’re poorly calibrated because they could use their information to get right answers, but they couldn’t use it to get good probabilities, and should vary what probability they assign to answers that feel like the answers did on that test.