This is a distinct problem from my (1) since every field has things that are nearly 100% certain to happen.
I think that’s because your #1 is actually two distinct problems :-). You mentioned two classes of prediction whose result would be uninformative about calibration elsewhere: easy near-certain predictions, and straightforward unambiguous probability questions. (Though … I bet the d10 results would be a little bit informative, in cases where e.g. the die has rolled a few 1s recently. But only via illuminating basic statistical cluefulness.) I agree that there are some of the former in every field, but in practice people interested in their calibration don’t tend to bother much with them. (They may make some 99% predictions, but those quite often turn out to be more like 80% predictions in reality.)
this approach gets the wrong sign [...] would count as two failed predictions when it is effectively just one.
It gets the right sign when the question is “how do we use these calibration numbers to predict future reliability?” (if someone demonstrates ability in guessing the next president, that confers some advantages in guessing the next supreme court judge) and the wrong sign when it’s “how do we use these prediction results to estimate calibration?”. I agree that it would be useful to have some way to identify when multiple predictions are genuinely uninformative because they’re probing the exact same underlying event(s) rather than merely similar ones. In practice I suspect the best we can do is try to notice and adjust ad hoc, which of course brings problems of its own.
(There was a study a little while back that looked at a bunch of pundits and assessed their accuracy, and lo! it turned out that the lefties were more reliable than the righties. That conclusion suited my own politics just fine but I’m pretty sure it was at least half bullshit because a lot of the predictions were about election results and the like, the time of the study was a time when the left, or what passes for the left in the US context, was doing rather well, and pundits tend to be overoptimistic about political people on their side. If we did a similar study right now I bet it would lean the other way. Perhaps averaging the two might actually tell us something useful.)
I think that’s because your #1 is actually two distinct problems :-). You mentioned two classes of prediction whose result would be uninformative about calibration elsewhere: easy near-certain predictions, and straightforward unambiguous probability questions. (Though … I bet the d10 results would be a little bit informative, in cases where e.g. the die has rolled a few 1s recently. But only via illuminating basic statistical cluefulness.) I agree that there are some of the former in every field, but in practice people interested in their calibration don’t tend to bother much with them. (They may make some 99% predictions, but those quite often turn out to be more like 80% predictions in reality.)
It gets the right sign when the question is “how do we use these calibration numbers to predict future reliability?” (if someone demonstrates ability in guessing the next president, that confers some advantages in guessing the next supreme court judge) and the wrong sign when it’s “how do we use these prediction results to estimate calibration?”. I agree that it would be useful to have some way to identify when multiple predictions are genuinely uninformative because they’re probing the exact same underlying event(s) rather than merely similar ones. In practice I suspect the best we can do is try to notice and adjust ad hoc, which of course brings problems of its own.
(There was a study a little while back that looked at a bunch of pundits and assessed their accuracy, and lo! it turned out that the lefties were more reliable than the righties. That conclusion suited my own politics just fine but I’m pretty sure it was at least half bullshit because a lot of the predictions were about election results and the like, the time of the study was a time when the left, or what passes for the left in the US context, was doing rather well, and pundits tend to be overoptimistic about political people on their side. If we did a similar study right now I bet it would lean the other way. Perhaps averaging the two might actually tell us something useful.)