I think you’re approaching this in the wrong frame of reference.
No one is trying to discover new mathematical truths here. The action of constructing a particular metric (for evaluating calibration) is akin to applied engineering—you need to design something fit for a specific purpose while making a set of trade-offs in the process. You are not going to tell some guy in Taiwan designing a new motherboard that he’s silly and should just go read the academic literature and do what it tells him to do, are you?
I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]
I think you’re approaching this in the wrong frame of reference.
No one is trying to discover new mathematical truths here. The action of constructing a particular metric (for evaluating calibration) is akin to applied engineering—you need to design something fit for a specific purpose while making a set of trade-offs in the process. You are not going to tell some guy in Taiwan designing a new motherboard that he’s silly and should just go read the academic literature and do what it tells him to do, are you?
I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]