Suppose that I am given a calibration question about a racehorse and I guess “Secretariat” (since that’s the only horse I remember) and give a 30% probability (since I figure it’s a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I’ll look really underconfident.
But that’s just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?
That actually doesn’t help much. “Secretariat” is a really obvious guess—probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.
The LW survey had better survey design than that, with 10 calibration questions. But Yvain’s data analysis had exactly this problem—he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct—just a 1.4 percentage point bias towards overconfidence.
Being well-calibrated does not only mean “not overconfident on average, and not underconfident on average”. It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say “90%” it happens 90% of the time, when you say “80%” it happens 80% of the time, etc. In D_Malik’s hypothetical scenario where you always answer “80%”, we aren’t getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.
That is why I looked at all 10 questions in aggregate.
Well, you did not look at calibration, you looked at overconfidence which I don’t think is a terribly useful metric—it ignores the actual calibration (the match between the confidence and the answer) and just smushes everything into two averages.
It reminds me of an old joke about a guy who went hunting with his friend the statistician. They found a deer, the hunter aimed, fired—and missed. The bullet went six feet to the left of the deer. Amazingly, the deer ignored the shot, so the hunter aimed again, fired, and this time the bullet went six feet to the right of the deer. “You got him, you got him!” yelled the statistician...
So, no, I don’t think that overconfidence is a useful metric when we’re talking about calibration.
but I also did another analysis which looked at slopes across the range of subjective probabilities
Sorry, ordinary least-squares regression is the wrong tool to use when your response variable is binary. Your slopes are not valid. You need to use logistic regression.
Overconfidence is the main failure of calibration that people tend to make in the published research. If LWers are barely overconfident, then that is pretty interesting.
I used linear regression because perfect calibration is reflected by a linear relationship between subjective probability and correct answers, with a slope of 1.
If you prefer, here is a graph in the same style that Yvain used.
X-axis shows subjective probability, with responses divided into 11 bins (<5, <15, …, <95, and 95+). Y-axis shows proportion correct in each bin, blue dots show data from all LWers on all calibration questions (after data cleaning), and the line indicates perfect calibration. Dots below the line indicate overconfidence, dots above the line indicate underconfidence. Sample size for the bins ranges from 461 to 2241.
Suppose that I am given a calibration question about a racehorse and I guess “Secretariat” (since that’s the only horse I remember) and give a 30% probability (since I figure it’s a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I’ll look really underconfident.
But that’s just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?
That actually doesn’t help much. “Secretariat” is a really obvious guess—probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.
The LW survey had better survey design than that, with 10 calibration questions. But Yvain’s data analysis had exactly this problem—he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct—just a 1.4 percentage point bias towards overconfidence.
Being well-calibrated does not only mean “not overconfident on average, and not underconfident on average”. It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say “90%” it happens 90% of the time, when you say “80%” it happens 80% of the time, etc. In D_Malik’s hypothetical scenario where you always answer “80%”, we aren’t getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.
Well, you did not look at calibration, you looked at overconfidence which I don’t think is a terribly useful metric—it ignores the actual calibration (the match between the confidence and the answer) and just smushes everything into two averages.
It reminds me of an old joke about a guy who went hunting with his friend the statistician. They found a deer, the hunter aimed, fired—and missed. The bullet went six feet to the left of the deer. Amazingly, the deer ignored the shot, so the hunter aimed again, fired, and this time the bullet went six feet to the right of the deer. “You got him, you got him!” yelled the statistician...
So, no, I don’t think that overconfidence is a useful metric when we’re talking about calibration.
Sorry, ordinary least-squares regression is the wrong tool to use when your response variable is binary. Your slopes are not valid. You need to use logistic regression.
Overconfidence is the main failure of calibration that people tend to make in the published research. If LWers are barely overconfident, then that is pretty interesting.
I used linear regression because perfect calibration is reflected by a linear relationship between subjective probability and correct answers, with a slope of 1.
If you prefer, here is a graph in the same style that Yvain used.
X-axis shows subjective probability, with responses divided into 11 bins (<5, <15, …, <95, and 95+). Y-axis shows proportion correct in each bin, blue dots show data from all LWers on all calibration questions (after data cleaning), and the line indicates perfect calibration. Dots below the line indicate overconfidence, dots above the line indicate underconfidence. Sample size for the bins ranges from 461 to 2241.