I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.
I think the two of you are looking at different parts of the process.
“Amount of trickiness” is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.
Otoh, “estimate of the average trickiness of quizzes” is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn’t to get that particular question right, it does cause a systematic bias when applying the results to every other situation.