If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
You may find the Wikipedia page on scoring rules interesting.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It’s unclear what kind will work best here and what that “best” even means.
Yes, thank you, that’s useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.