I have seen a problem with selection bias in calibration tests, where trick questions are overrepresented. For example, in this PDF article, the authors ask subjects to provide a 90% confidence interval estimating the number of employees IBM has. They find that fewer than 90% of subjects select a suitable range, which they conclude results from overconfidence. However, IBM has almost 400,000 employees, which is atypically high (more than 4x Microsoft). The results of this study have just as much to do with the question asked as with the overconfidence of the subjects.
Similarly, trivia questions are frequently (though not always) designed to have interesting/unintuitive answers, making them problematic for a calibration quiz where people are expecting straightforward questions. I don’t know that to be the case for the AcceleratingFuture quizzes, but it is an issue in general.
That really shouldn’t matter. Your calibration should include the chances of the question being a “trick question”. If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they’re being overconfident by underestimating the probability that the question has an unexpected answer.
Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.
What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he’s trying to reverse-trick you, or maybe he just chose his questions at random. It’s a meaningless exercise because you’re being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.
People taking a calibration test should first see the answers to a sample of the data set they will be tested on.
I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.
I have seen a problem with selection bias in calibration tests, where trick questions are overrepresented. For example, in this PDF article, the authors ask subjects to provide a 90% confidence interval estimating the number of employees IBM has. They find that fewer than 90% of subjects select a suitable range, which they conclude results from overconfidence. However, IBM has almost 400,000 employees, which is atypically high (more than 4x Microsoft). The results of this study have just as much to do with the question asked as with the overconfidence of the subjects.
Similarly, trivia questions are frequently (though not always) designed to have interesting/unintuitive answers, making them problematic for a calibration quiz where people are expecting straightforward questions. I don’t know that to be the case for the AcceleratingFuture quizzes, but it is an issue in general.
That really shouldn’t matter. Your calibration should include the chances of the question being a “trick question”. If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they’re being overconfident by underestimating the probability that the question has an unexpected answer.
Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.
This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.
What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he’s trying to reverse-trick you, or maybe he just chose his questions at random. It’s a meaningless exercise because you’re being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.
People taking a calibration test should first see the answers to a sample of the data set they will be tested on.
I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.