Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.
What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he’s trying to reverse-trick you, or maybe he just chose his questions at random. It’s a meaningless exercise because you’re being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.
People taking a calibration test should first see the answers to a sample of the data set they will be tested on.
I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.
Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.
What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he’s trying to reverse-trick you, or maybe he just chose his questions at random. It’s a meaningless exercise because you’re being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.
People taking a calibration test should first see the answers to a sample of the data set they will be tested on.
I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.