Do you have some links to calibration training? I’m curious how they handle model error (the error when your model is totally wrong).
For question 10 for example, I’m guessing that many more people would have gotten the correct answer if the question was something like “Name the best selling PC game, where best selling solely counts units not gross, number of box purchases and not subscriptions, and also does not count games packaged with other software?” instead of “What is the best-selling computer game of all time?”. I’m guessing most people answered WOW or Solitaire/Minesweeper or Tetris, each of which would be the correct answer if you remove on of those restraiints.
But it seems hard to guess beforehand that the question you thought you were answering wasn’t the question that you were being asked! So you’d end up distributing that model error relatively evenly over all the questions, and so you’d end up underconfident on the questions where the model was straightfoward and correct and overconfident when the question wasn’t as simple as it appeared.
I’m curious how they handle model error (the error when your model is totally wrong).
They punish it. That is, your stated credence should include both your ‘inside view’ error of “How confident is my mythology module in this answer?” and your ‘outside view’ error of “How confident am I in my mythology module?”
One of the primary benefits of playing a Credence Game like this one is it gives you a sense of those outside view confidences. I am, for example, able to tell which of two American postmasters general came first at the 60% level, simply by using the heuristic of “which of these names sounds more old-timey?”, but am at the 50% level (i.e. pure chance) in determining which sports team won a game by comparing their names.
But it seems hard to guess beforehand that the question you thought you were answering wasn’t the question that you were being asked!
This is the sort of thing you learn by answering a bunch of questions from the same person, or by having a lawyer-sense of “how many qualifications would I need to add or remove to this sentence to be sure?”.
OK, so all that makes sense and seems basically correct, but I don’t see how you get from there to being able to map confidence for persons across a question the same way you can for questions across a person.
Adopting that terminology, I’m saying for a typical Less Wrong user, they likely have a similar understanding-the-question module. This module will be right most of the time and wrong some of the time, so they correctly apply the outside view error afterwards on each of their estimates. Since the understanding-the-question module is similar for each person, though, the actual errors aren’t evenly distributed across questions, so they will underestimate on “easy” questions and overestimate on “hard” ones, if easy and hard are determined afterwards by percentage that get the answer correct.
Since the understanding-the-question module is similar for each person, though, the actual errors aren’t evenly distributed across questions, so they will underestimate on “easy” questions and overestimate on “hard” ones, if easy and hard are determined afterwards by percentage that get the answer correct.
That seems reasonable to me, yes, as an easy way for a question to be ‘hard’ is if most answerers interpret it differently from the questioner.
Do you have some links to calibration training? I’m curious how they handle model error (the error when your model is totally wrong).
For question 10 for example, I’m guessing that many more people would have gotten the correct answer if the question was something like “Name the best selling PC game, where best selling solely counts units not gross, number of box purchases and not subscriptions, and also does not count games packaged with other software?” instead of “What is the best-selling computer game of all time?”. I’m guessing most people answered WOW or Solitaire/Minesweeper or Tetris, each of which would be the correct answer if you remove on of those restraiints.
But it seems hard to guess beforehand that the question you thought you were answering wasn’t the question that you were being asked! So you’d end up distributing that model error relatively evenly over all the questions, and so you’d end up underconfident on the questions where the model was straightfoward and correct and overconfident when the question wasn’t as simple as it appeared.
They punish it. That is, your stated credence should include both your ‘inside view’ error of “How confident is my mythology module in this answer?” and your ‘outside view’ error of “How confident am I in my mythology module?”
One of the primary benefits of playing a Credence Game like this one is it gives you a sense of those outside view confidences. I am, for example, able to tell which of two American postmasters general came first at the 60% level, simply by using the heuristic of “which of these names sounds more old-timey?”, but am at the 50% level (i.e. pure chance) in determining which sports team won a game by comparing their names.
This is the sort of thing you learn by answering a bunch of questions from the same person, or by having a lawyer-sense of “how many qualifications would I need to add or remove to this sentence to be sure?”.
OK, so all that makes sense and seems basically correct, but I don’t see how you get from there to being able to map confidence for persons across a question the same way you can for questions across a person.
Adopting that terminology, I’m saying for a typical Less Wrong user, they likely have a similar understanding-the-question module. This module will be right most of the time and wrong some of the time, so they correctly apply the outside view error afterwards on each of their estimates. Since the understanding-the-question module is similar for each person, though, the actual errors aren’t evenly distributed across questions, so they will underestimate on “easy” questions and overestimate on “hard” ones, if easy and hard are determined afterwards by percentage that get the answer correct.
That seems reasonable to me, yes, as an easy way for a question to be ‘hard’ is if most answerers interpret it differently from the questioner.