If they were perfectly calibrated on this one-off prediction, about 14 should’ve had the actual outcome fall in their 80% confidence interval.
Nope. Suppose I roll a 100-sided die, and all LessWrongers write down their centred 80% credible interval for where the answer should fall. If the LWers are rational and calibrated, that interval should be [10,90]. So the actual outcome will fall in everybody’s credible interval or nobody’s. The relevant averaging should happen across questions, not across predictors.
So I thought about it, and I think you’re correct. At least, if their error was correlated and nonrandom (which is very natural in many situations), then we wouldn’t expect 14 of 18 in the group to get it in their range. So you’re right. I can imagine hypotheticals where “group calibration in one-offs” might mean something, but not here now that I think about it.
Instead of that, I should’ve just stuck to pointing out how far out of the ranges many of them were. Rather than the proportion of the group that got it in their range. I.e. suppose an anonymous forecaster places a 0.0000001% probability on some natural macro-scale event having outcome A instead of B, on the next observation. Outcome A happens. That is strong evidence that they weren’t just “incorrect with a lot of error”; they’re probably really uncalibrated too.
Eyeballing the survey chart, many of those 80% CIs are so narrow, they imply the true figure to have been a many-sigma event. That’s really incredible, and I would take it as evidence they’re also uncalibrated (not just wrong in this one-off). That’s a much better way to infer they’re uncalibrated, than the 14 of 18 thing.
Semi-related: I’m unclear about a detail in your dice example. You say a rational and calibrated interval “should be [10,90]” for the dice rolling. Why? A calibrated-over-time 80% confidence interval on such dice could be placed anywhere (e.g. [1,80] or [21,100], so long as they are 80 units wide.
Nope. Suppose I roll a 100-sided die, and all LessWrongers write down their centred 80% credible interval for where the answer should fall. If the LWers are rational and calibrated, that interval should be [10,90]. So the actual outcome will fall in everybody’s credible interval or nobody’s. The relevant averaging should happen across questions, not across predictors.
So I thought about it, and I think you’re correct. At least, if their error was correlated and nonrandom (which is very natural in many situations), then we wouldn’t expect 14 of 18 in the group to get it in their range. So you’re right. I can imagine hypotheticals where “group calibration in one-offs” might mean something, but not here now that I think about it.
Instead of that, I should’ve just stuck to pointing out how far out of the ranges many of them were. Rather than the proportion of the group that got it in their range. I.e. suppose an anonymous forecaster places a 0.0000001% probability on some natural macro-scale event having outcome A instead of B, on the next observation. Outcome A happens. That is strong evidence that they weren’t just “incorrect with a lot of error”; they’re probably really uncalibrated too.
Eyeballing the survey chart, many of those 80% CIs are so narrow, they imply the true figure to have been a many-sigma event. That’s really incredible, and I would take it as evidence they’re also uncalibrated (not just wrong in this one-off). That’s a much better way to infer they’re uncalibrated, than the 14 of 18 thing.
Semi-related: I’m unclear about a detail in your dice example. You say a rational and calibrated interval “should be [10,90]” for the dice rolling. Why? A calibrated-over-time 80% confidence interval on such dice could be placed anywhere (e.g. [1,80] or [21,100], so long as they are 80 units wide.
Yes, but the calibrated and centered interval is uniquely [10, 90].