Do you have some links to calibration training? I’m curious how they handle model error (the error when your model is totally wrong).
For question 10 for example, I’m guessing that many more people would have gotten the correct answer if the question was something like “Name the best selling PC game, where best selling solely counts units not gross, number of box purchases and not subscriptions, and also does not count games packaged with other software?” instead of “What is the best-selling computer game of all time?”. I’m guessing most people answered WOW or Solitaire/Minesweeper or Tetris, each of which would be the correct answer if you remove on of those restraiints.
But it seems hard to guess beforehand that the question you thought you were answering wasn’t the question that you were being asked! So you’d end up distributing that model error relatively evenly over all the questions, and so you’d end up underconfident on the questions where the model was straightfoward and correct and overconfident when the question wasn’t as simple as it appeared.
OK, so all that makes sense and seems basically correct, but I don’t see how you get from there to being able to map confidence for persons across a question the same way you can for questions across a person.
Adopting that terminology, I’m saying for a typical Less Wrong user, they likely have a similar understanding-the-question module. This module will be right most of the time and wrong some of the time, so they correctly apply the outside view error afterwards on each of their estimates. Since the understanding-the-question module is similar for each person, though, the actual errors aren’t evenly distributed across questions, so they will underestimate on “easy” questions and overestimate on “hard” ones, if easy and hard are determined afterwards by percentage that get the answer correct.