Your link actually doesn’t provide any information about how to evaluate or estimate someone’s calibration which is what we are talking about.
If we don’t agree about what it is, it will be very difficult to agree how to evaluate it!
It’s not quite that. I’m not happy with this use of averages.
Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I’ve grouped together 8 predictions and labeled them “80%”, and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my “80%” confidence level as 50%.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is—getting 100% correct at the “80%” level isn’t that meaningful if I only used “80%” twice!).
I’ll need to think more about it, but off the top of my head, I’d look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares… But don’t quote me on that, I’m just thinking aloud here.
You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert’s ability to correlate their answer with the ground truth) and calibration (an expert’s ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.
*That is, assuming that the expert knows what rule you’re using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
You may find the Wikipedia page on scoring rules interesting.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.
If we don’t agree about what it is, it will be very difficult to agree how to evaluate it!
Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I’ve grouped together 8 predictions and labeled them “80%”, and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my “80%” confidence level as 50%.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is—getting 100% correct at the “80%” level isn’t that meaningful if I only used “80%” twice!).
You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert’s ability to correlate their answer with the ground truth) and calibration (an expert’s ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.
*That is, assuming that the expert knows what rule you’re using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.