There is the brier score, or any other proper scoring rule. These each have the advantage of being zero-degree-of-freedom up to the choice of scoring rule, though it isn’t information preserving, and isn’t comparable across different sets of predictions. (Though neither is any analogue of a CDF.)
The problem is that this measures their amount of knowledge about the questions as well as their calibration.
My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes’ rule we get a posterior describing what we know about that persons calibration function.
Then assign a score to each calibration function which is the expected log score lost by a person using that calibration function instead of an ideal one, assuming that the questions were uniformly distributed in difficulty for them. Then their final calibration score is just the expected value of that score given our distribution of calibration functions for them.
There is the brier score, or any other proper scoring rule. These each have the advantage of being zero-degree-of-freedom up to the choice of scoring rule, though it isn’t information preserving, and isn’t comparable across different sets of predictions. (Though neither is any analogue of a CDF.)
The problem is that this measures their amount of knowledge about the questions as well as their calibration.
My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes’ rule we get a posterior describing what we know about that persons calibration function.
Then assign a score to each calibration function which is the expected log score lost by a person using that calibration function instead of an ideal one, assuming that the questions were uniformly distributed in difficulty for them. Then their final calibration score is just the expected value of that score given our distribution of calibration functions for them.
Agreed that the proposal combines knowledge with calibration, but your procedure doesn’t actually seem implementable.