So I thought about it, and I think you’re correct. At least, if their error was correlated and nonrandom (which is very natural in many situations), then we wouldn’t expect 14 of 18 in the group to get it in their range. So you’re right. I can imagine hypotheticals where “group calibration in one-offs” might mean something, but not here now that I think about it.
Instead of that, I should’ve just stuck to pointing out how far out of the ranges many of them were. Rather than the proportion of the group that got it in their range. I.e. suppose an anonymous forecaster places a 0.0000001% probability on some natural macro-scale event having outcome A instead of B, on the next observation. Outcome A happens. That is strong evidence that they weren’t just “incorrect with a lot of error”; they’re probably really uncalibrated too.
Eyeballing the survey chart, many of those 80% CIs are so narrow, they imply the true figure to have been a many-sigma event. That’s really incredible, and I would take it as evidence they’re also uncalibrated (not just wrong in this one-off). That’s a much better way to infer they’re uncalibrated, than the 14 of 18 thing.
Semi-related: I’m unclear about a detail in your dice example. You say a rational and calibrated interval “should be [10,90]” for the dice rolling. Why? A calibrated-over-time 80% confidence interval on such dice could be placed anywhere (e.g. [1,80] or [21,100], so long as they are 80 units wide.
So I thought about it, and I think you’re correct. At least, if their error was correlated and nonrandom (which is very natural in many situations), then we wouldn’t expect 14 of 18 in the group to get it in their range. So you’re right. I can imagine hypotheticals where “group calibration in one-offs” might mean something, but not here now that I think about it.
Instead of that, I should’ve just stuck to pointing out how far out of the ranges many of them were. Rather than the proportion of the group that got it in their range. I.e. suppose an anonymous forecaster places a 0.0000001% probability on some natural macro-scale event having outcome A instead of B, on the next observation. Outcome A happens. That is strong evidence that they weren’t just “incorrect with a lot of error”; they’re probably really uncalibrated too.
Eyeballing the survey chart, many of those 80% CIs are so narrow, they imply the true figure to have been a many-sigma event. That’s really incredible, and I would take it as evidence they’re also uncalibrated (not just wrong in this one-off). That’s a much better way to infer they’re uncalibrated, than the 14 of 18 thing.
Semi-related: I’m unclear about a detail in your dice example. You say a rational and calibrated interval “should be [10,90]” for the dice rolling. Why? A calibrated-over-time 80% confidence interval on such dice could be placed anywhere (e.g. [1,80] or [21,100], so long as they are 80 units wide.
Yes, but the calibrated and centered interval is uniquely [10, 90].