Yes. (To nitpick, with existing platforms one would max out calibration in expectation by guessing 17.5% or 39% or 29%.)
Ideally we’d want to use a proper scoring rule, but this brings up other Goodharting issues: if people can select the questions they predict on, this will incentivize predicting on easier questions, and people who have made very few forecasts will often appear at the top of the ranking, so we’d like to use something like a credibility formula. I plan on writing something up on this.
if people can select the questions they predict on, this will incentivize predicting on easier questions
True. On the other hand, if I publicly say that I consider myself an expert on X and ignorant on Y, should my self-assessment on X be penalized just because I got the answers on Y wrong?
Is the metric calibration?
Yes*.
*: Terms and conditions apply. Marginal opinions only.
Couldn’t you max out on calibration by guessing 50% for everything?
Yes. (To nitpick, with existing platforms one would max out calibration in expectation by guessing 17.5% or 39% or 29%.)
Ideally we’d want to use a proper scoring rule, but this brings up other Goodharting issues: if people can select the questions they predict on, this will incentivize predicting on easier questions, and people who have made very few forecasts will often appear at the top of the ranking, so we’d like to use something like a credibility formula. I plan on writing something up on this.
True. On the other hand, if I publicly say that I consider myself an expert on X and ignorant on Y, should my self-assessment on X be penalized just because I got the answers on Y wrong?
Depends on the correlation in accuracy within X vs between X and Y.
What pool of questions would people make predictions on?