I think #1 is better understood as follows: you can be differently calibrated (at a given confidence level) for different kinds of question. Most of us are very well calibrated for questions about rolling dice. We may be markedly better or worse—and differently—calibrated for questions about future scientific progress, about economic consequences of political policies, about other people’s aesthetic preferences, etc.
That suggests a slightly different remedy for #1 from yours: group predictions according to subject matter. (More generally, if you’re interested in knowing how well someone calibrated for predictions of a certain kind, we could weight all their predictions according to how closely related we think they are to that. Grouping corresponds to making all the weights 0 or 1.) This will also help with #2: if someone was 99% confident that Clinton would win the election then that mistake will weigh heavily in evaluating their calibration for political and polling questions. [EDITED to add: And much less heavily in evaluating their calibration for dice-rolling or progress in theoretical physics.]
Talk of the Clinton/Trump election brings up another issue not covered by typical calibration measures. Someone who was 99% confident Clinton would win clearly made a big mistake. But so, probably, did someone who was 99% confident that Trump would win. (Maybe not; perhaps there was some reason why actually he was almost certain to win. For instance, if the conspiracy theories about Russian hacking are right and someone was 99% confident because they had inside knowledge of it. If something like that turns out to be the case then we should imagine this replaced with a better example.)
Sometimes, but not always, when an outcome becomes known we also get a good idea what the “real” probability was, in some sense. For instance, the Clinton/Trump election was extremely close; unless there was some sort of foul play that probably indicates that something around 50% would have been a good prediction. When assessing calibration, we should probably treat it like (in this case) approximately 50% of a right answer and 50% of a wrong answer.
(Because it’s always possible that something happened for a poorly-understood reason—Russian hacking, divine intervention, whatever—perhaps we should always fudge these estimates by pushing them towards the actual outcome. So, e.g., even if the detailed election results suggest that 50% was a good estimate for Clinton/Trump, maybe we should use 0.7/0.3 or something instead of 0.5/0.5.)
I think #1 is better understood as follows: you can be differently calibrated (at a given confidence level) for different kinds of question. Most of us are very well calibrated for questions about rolling dice. We may be markedly better or worse—and differently—calibrated for questions about future scientific progress, about economic consequences of political policies, about other people’s aesthetic preferences, etc.
That suggests a slightly different remedy for #1 from yours: group predictions according to subject matter. (More generally, if you’re interested in knowing how well someone calibrated for predictions of a certain kind, we could weight all their predictions according to how closely related we think they are to that. Grouping corresponds to making all the weights 0 or 1.) This will also help with #2: if someone was 99% confident that Clinton would win the election then that mistake will weigh heavily in evaluating their calibration for political and polling questions. [EDITED to add: And much less heavily in evaluating their calibration for dice-rolling or progress in theoretical physics.]
Talk of the Clinton/Trump election brings up another issue not covered by typical calibration measures. Someone who was 99% confident Clinton would win clearly made a big mistake. But so, probably, did someone who was 99% confident that Trump would win. (Maybe not; perhaps there was some reason why actually he was almost certain to win. For instance, if the conspiracy theories about Russian hacking are right and someone was 99% confident because they had inside knowledge of it. If something like that turns out to be the case then we should imagine this replaced with a better example.)
Sometimes, but not always, when an outcome becomes known we also get a good idea what the “real” probability was, in some sense. For instance, the Clinton/Trump election was extremely close; unless there was some sort of foul play that probably indicates that something around 50% would have been a good prediction. When assessing calibration, we should probably treat it like (in this case) approximately 50% of a right answer and 50% of a wrong answer.
(Because it’s always possible that something happened for a poorly-understood reason—Russian hacking, divine intervention, whatever—perhaps we should always fudge these estimates by pushing them towards the actual outcome. So, e.g., even if the detailed election results suggest that 50% was a good estimate for Clinton/Trump, maybe we should use 0.7/0.3 or something instead of 0.5/0.5.)