Note that it is possible to deceive others by systematically adjusting predictions upward or downward to reflect how desirable it is that other people believe those predictions, in a way which preserves your score.
This is true even if you bucket your scores; say you’re evaluating somebody’s predictive scores. You see that when they assign a 60% probability to an event, that event occurs 60% of the time. This doesn’t mean that any -specific- prediction they make of 60% probability will occur 60% of the time, however! They can balance out their predictions by adjusting two different predictions, overestimating the odds of one, and underestimating the odds of another, to give the appearance of perfectly calibrated predictions.
The Brier Score is useful for evaluating how good a forecaster is, but this is not the same as evaluating how good any given forecast a forecaster makes is. If the Oracle really hates Odysseus, the Oracle could give a forecast that, if believed, results in a worse outcome for Odysseus, and balance this out by giving a forecast to another individual that results in apparent perfect calibration.
Agreed. I think a strong reason why this might work at all is that forecasters are primarily judged by some other strictly proper scoring rule—meaning that they wouldn’t have an incentive to fake calibration if it makes them come out worse in terms of e.g. Brier or log score.
Note that it is possible to deceive others by systematically adjusting predictions upward or downward to reflect how desirable it is that other people believe those predictions, in a way which preserves your score.
This is true even if you bucket your scores; say you’re evaluating somebody’s predictive scores. You see that when they assign a 60% probability to an event, that event occurs 60% of the time. This doesn’t mean that any -specific- prediction they make of 60% probability will occur 60% of the time, however! They can balance out their predictions by adjusting two different predictions, overestimating the odds of one, and underestimating the odds of another, to give the appearance of perfectly calibrated predictions.
The Brier Score is useful for evaluating how good a forecaster is, but this is not the same as evaluating how good any given forecast a forecaster makes is. If the Oracle really hates Odysseus, the Oracle could give a forecast that, if believed, results in a worse outcome for Odysseus, and balance this out by giving a forecast to another individual that results in apparent perfect calibration.
Agreed. I think a strong reason why this might work at all is that forecasters are primarily judged by some other strictly proper scoring rule—meaning that they wouldn’t have an incentive to fake calibration if it makes them come out worse in terms of e.g. Brier or log score.