D_Malik’s scenario illustrates that it doesn’t make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
For example, I say “heads” every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately—that is, look at the situation retrospectively—then you would think that I was simultaneously underconfident and overconfident.
To make it clearer what it should look like normally, suppose there are two coins, red and blue. The red coin lands heads 80% of the time and the blue coin lands heads 70% of the time, and we alternate between flipping the red coin and the blue coin.
If I always answer heads, with 80% when it’s red and 70% when it’s blue, I will be as calibrated as someone who always answers heads with 75%, but will have more skill. But retrospectively, one will be able to make the claim that we are underconfident and overconfident.
D_Malik’s scenario illustrates that it doesn’t make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
Yes, I agree with that. However it still seems to me that the example with coins is misleading and that the given example of “perfect calibration” is anything but. Let me try to explain.
Since we’re talking about calibration, let’s not use coin flips but use calibration questions.
Alice gets 100 calibration questions. To each one she provides an answer plus her confidence in her answer expressed as a percentage.
In both yours and D_Malik’s example the confidence given is the same for all questions. Let’s say it is 80%. That is an important part: Alice gives her confidence for each question as 80%. This means that for her the difficulty of each question is the same—she cannot distinguish between then on the basis of difficulty.
Let’s say the correctness of the answer is binary—it’s either correct or not. It is quite obvious that if we collect all Alice’s correct answers in one pile and all her incorrect answers in another pile, she will look to be miscalibrated, both underconfident (for the correct pile) and overconfident (for the incorrect pile).
But now we have the issue that some questions are “easy” and some are “hard”. My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she’ll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones). If this is so (and assuming the test-giver is right about Alice which is testable by looking at the proportions of easy and hard questions in the correct and incorrect piles), then Alice fails calibration because she cannot distinguish easy and hard questions.
You are suggesting, however, that there is an alternate definition of “easy” and “hard” which is the post-factum assignment of the “easy” label to all questions in the correct pile and of the “hard” label to all questions in the incorrect pile. That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.
P.S. And, by the way, the original comment which started this subthread quoted Yvain and then D_Malik pronounced Yvain’s conclusions suspicious. But Yvain did not condition on the outcomes (correct/incorrect answers), he conditioned on confidence! It’s a perfectly valid exercise to create a subset of questions where someone declared, say, 50% confidence, and then see if the proportion of correct answers is around that 50%.
Suppose that I am given a calibration question about a racehorse and I guess “Secretariat” (since that’s the only horse I remember) and give a 30% probability (since I figure it’s a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I’ll look really underconfident.
But that’s just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?
That actually doesn’t help much. “Secretariat” is a really obvious guess—probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.
The LW survey had better survey design than that, with 10 calibration questions. But Yvain’s data analysis had exactly this problem—he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct—just a 1.4 percentage point bias towards overconfidence.
Being well-calibrated does not only mean “not overconfident on average, and not underconfident on average”. It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say “90%” it happens 90% of the time, when you say “80%” it happens 80% of the time, etc. In D_Malik’s hypothetical scenario where you always answer “80%”, we aren’t getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.
That is why I looked at all 10 questions in aggregate.
Well, you did not look at calibration, you looked at overconfidence which I don’t think is a terribly useful metric—it ignores the actual calibration (the match between the confidence and the answer) and just smushes everything into two averages.
It reminds me of an old joke about a guy who went hunting with his friend the statistician. They found a deer, the hunter aimed, fired—and missed. The bullet went six feet to the left of the deer. Amazingly, the deer ignored the shot, so the hunter aimed again, fired, and this time the bullet went six feet to the right of the deer. “You got him, you got him!” yelled the statistician...
So, no, I don’t think that overconfidence is a useful metric when we’re talking about calibration.
but I also did another analysis which looked at slopes across the range of subjective probabilities
Sorry, ordinary least-squares regression is the wrong tool to use when your response variable is binary. Your slopes are not valid. You need to use logistic regression.
Overconfidence is the main failure of calibration that people tend to make in the published research. If LWers are barely overconfident, then that is pretty interesting.
I used linear regression because perfect calibration is reflected by a linear relationship between subjective probability and correct answers, with a slope of 1.
If you prefer, here is a graph in the same style that Yvain used.
X-axis shows subjective probability, with responses divided into 11 bins (<5, <15, …, <95, and 95+). Y-axis shows proportion correct in each bin, blue dots show data from all LWers on all calibration questions (after data cleaning), and the line indicates perfect calibration. Dots below the line indicate overconfidence, dots above the line indicate underconfidence. Sample size for the bins ranges from 461 to 2241.
My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she’ll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones).
I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions’ difficulty and an interesting analysis. That was not done in this case.
That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.
The labeling is not obviously stupid—what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain’s impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed’s analysis is more insightful than Yvain’s and their critiques are valid.
However it still seems to me that the example with coins is misleading and that the given example of “perfect calibration” is anything but.
It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings (‘calibration’) to a narrow mapping (“80%”->80%) instead of a broad mapping (“50%”->50%, “60%”->60%, etc.), it’s valid to be skeptical that the calibration will generalize—but it doesn’t mean the assessment is uncalibrated.
It is according to what calibration means in the context of probabilities.
Your link actually doesn’t provide any information about how to evaluate or estimate someone’s calibration which is what we are talking about.
if you are unhappy that we are assigning a property of correct mappings (‘calibration’) to a narrow mapping
It’s not quite that. I’m not happy with this use of averages. I’ll need to think more about it, but off the top of my head, I’d look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares… But don’t quote me on that, I’m just thinking aloud here.
Your link actually doesn’t provide any information about how to evaluate or estimate someone’s calibration which is what we are talking about.
If we don’t agree about what it is, it will be very difficult to agree how to evaluate it!
It’s not quite that. I’m not happy with this use of averages.
Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I’ve grouped together 8 predictions and labeled them “80%”, and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my “80%” confidence level as 50%.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is—getting 100% correct at the “80%” level isn’t that meaningful if I only used “80%” twice!).
I’ll need to think more about it, but off the top of my head, I’d look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares… But don’t quote me on that, I’m just thinking aloud here.
You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert’s ability to correlate their answer with the ground truth) and calibration (an expert’s ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.
*That is, assuming that the expert knows what rule you’re using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
You may find the Wikipedia page on scoring rules interesting.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.
D_Malik’s scenario illustrates that it doesn’t make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
For example, I say “heads” every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately—that is, look at the situation retrospectively—then you would think that I was simultaneously underconfident and overconfident.
To make it clearer what it should look like normally, suppose there are two coins, red and blue. The red coin lands heads 80% of the time and the blue coin lands heads 70% of the time, and we alternate between flipping the red coin and the blue coin.
If I always answer heads, with 80% when it’s red and 70% when it’s blue, I will be as calibrated as someone who always answers heads with 75%, but will have more skill. But retrospectively, one will be able to make the claim that we are underconfident and overconfident.
Yes, I agree with that. However it still seems to me that the example with coins is misleading and that the given example of “perfect calibration” is anything but. Let me try to explain.
Since we’re talking about calibration, let’s not use coin flips but use calibration questions.
Alice gets 100 calibration questions. To each one she provides an answer plus her confidence in her answer expressed as a percentage.
In both yours and D_Malik’s example the confidence given is the same for all questions. Let’s say it is 80%. That is an important part: Alice gives her confidence for each question as 80%. This means that for her the difficulty of each question is the same—she cannot distinguish between then on the basis of difficulty.
Let’s say the correctness of the answer is binary—it’s either correct or not. It is quite obvious that if we collect all Alice’s correct answers in one pile and all her incorrect answers in another pile, she will look to be miscalibrated, both underconfident (for the correct pile) and overconfident (for the incorrect pile).
But now we have the issue that some questions are “easy” and some are “hard”. My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she’ll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones). If this is so (and assuming the test-giver is right about Alice which is testable by looking at the proportions of easy and hard questions in the correct and incorrect piles), then Alice fails calibration because she cannot distinguish easy and hard questions.
You are suggesting, however, that there is an alternate definition of “easy” and “hard” which is the post-factum assignment of the “easy” label to all questions in the correct pile and of the “hard” label to all questions in the incorrect pile. That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.
P.S. And, by the way, the original comment which started this subthread quoted Yvain and then D_Malik pronounced Yvain’s conclusions suspicious. But Yvain did not condition on the outcomes (correct/incorrect answers), he conditioned on confidence! It’s a perfectly valid exercise to create a subset of questions where someone declared, say, 50% confidence, and then see if the proportion of correct answers is around that 50%.
Suppose that I am given a calibration question about a racehorse and I guess “Secretariat” (since that’s the only horse I remember) and give a 30% probability (since I figure it’s a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I’ll look really underconfident.
But that’s just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?
That actually doesn’t help much. “Secretariat” is a really obvious guess—probably lots of people who know only a little about horseracing will make the same guess, with low to middling probability, and wind up getting it right. On that question, LWers will look horrendously underconfident. The problem with this method is that, in a sense, it still has a sample size of only one, since tests of calibration are sampling both from people and from questions.
The LW survey had better survey design than that, with 10 calibration questions. But Yvain’s data analysis had exactly this problem—he analyzed the questions one-by-one, leading (unsurprisingly) to the result that LWers looked wildly underconfident on some questions and wildly overconfident on others. That is why I looked at all 10 questions in aggregate. On average (after some data cleanup) LWers gave a probability of 47.9% and got 44.0% correct. Just 3.9 percentage points of overconfidence. For LWers with 1000+ karma, the average estimate was 49.8% and they got 48.3% correct—just a 1.4 percentage point bias towards overconfidence.
Being well-calibrated does not only mean “not overconfident on average, and not underconfident on average”. It also means that your probability estimates track the actual frequencies across the whole range from 0 to 1 - when you say “90%” it happens 90% of the time, when you say “80%” it happens 80% of the time, etc. In D_Malik’s hypothetical scenario where you always answer “80%”, we aren’t getting any data on your calibration for the rest of the range of subjective probabilities. But that scenario could be modified to show calibration across the whole range (e.g., several biased coins, with known biases). My analysis of the LW survey in the previous paragraph also only addresses overconfidence on average, but I also did another analysis which looked at slopes across the range of subjective probabilities and found similar results.
Well, you did not look at calibration, you looked at overconfidence which I don’t think is a terribly useful metric—it ignores the actual calibration (the match between the confidence and the answer) and just smushes everything into two averages.
It reminds me of an old joke about a guy who went hunting with his friend the statistician. They found a deer, the hunter aimed, fired—and missed. The bullet went six feet to the left of the deer. Amazingly, the deer ignored the shot, so the hunter aimed again, fired, and this time the bullet went six feet to the right of the deer. “You got him, you got him!” yelled the statistician...
So, no, I don’t think that overconfidence is a useful metric when we’re talking about calibration.
Sorry, ordinary least-squares regression is the wrong tool to use when your response variable is binary. Your slopes are not valid. You need to use logistic regression.
Overconfidence is the main failure of calibration that people tend to make in the published research. If LWers are barely overconfident, then that is pretty interesting.
I used linear regression because perfect calibration is reflected by a linear relationship between subjective probability and correct answers, with a slope of 1.
If you prefer, here is a graph in the same style that Yvain used.
X-axis shows subjective probability, with responses divided into 11 bins (<5, <15, …, <95, and 95+). Y-axis shows proportion correct in each bin, blue dots show data from all LWers on all calibration questions (after data cleaning), and the line indicates perfect calibration. Dots below the line indicate overconfidence, dots above the line indicate underconfidence. Sample size for the bins ranges from 461 to 2241.
I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions’ difficulty and an interesting analysis. That was not done in this case.
The labeling is not obviously stupid—what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain’s impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed’s analysis is more insightful than Yvain’s and their critiques are valid.
It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings (‘calibration’) to a narrow mapping (“80%”->80%) instead of a broad mapping (“50%”->50%, “60%”->60%, etc.), it’s valid to be skeptical that the calibration will generalize—but it doesn’t mean the assessment is uncalibrated.
Your link actually doesn’t provide any information about how to evaluate or estimate someone’s calibration which is what we are talking about.
It’s not quite that. I’m not happy with this use of averages. I’ll need to think more about it, but off the top of my head, I’d look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares… But don’t quote me on that, I’m just thinking aloud here.
If we don’t agree about what it is, it will be very difficult to agree how to evaluate it!
Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I’ve grouped together 8 predictions and labeled them “80%”, and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my “80%” confidence level as 50%.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is—getting 100% correct at the “80%” level isn’t that meaningful if I only used “80%” twice!).
You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert’s ability to correlate their answer with the ground truth) and calibration (an expert’s ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.
*That is, assuming that the expert knows what rule you’re using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.