I am curious what kind of analysis you plan to run on the calibration questions. Obvious things to do:
For each user, compute the correlation between their probabilities and the 0-1 vector of right and wrong answers. Then display the correlations in some way (a histogram?).
For each question, compute the mean (or median) of the probability for the correct answers and for the wrong answers, and see how separated they are.
But neither of those feels like a really satisfactory measure of calibration.
At the very least, I suspect one of the analyses will be ‘bucketize corresponding to certainty, then plot “what % of responses in bucket were right?”’ - something that was done last year (see 2013 LessWrong Survey Results)
Last year it was broken down into “elite” and “typical” LW-er groups, which presumably would tell you if hanging out here made you better at overconfidence, or something similar in that general vicinity.
I am curious what kind of analysis you plan to run on the calibration questions. Obvious things to do:
For each user, compute the correlation between their probabilities and the 0-1 vector of right and wrong answers. Then display the correlations in some way (a histogram?).
For each question, compute the mean (or median) of the probability for the correct answers and for the wrong answers, and see how separated they are.
But neither of those feels like a really satisfactory measure of calibration.
At the very least, I suspect one of the analyses will be ‘bucketize corresponding to certainty, then plot “what % of responses in bucket were right?”’ - something that was done last year (see 2013 LessWrong Survey Results)
Last year it was broken down into “elite” and “typical” LW-er groups, which presumably would tell you if hanging out here made you better at overconfidence, or something similar in that general vicinity.