Scoring ambiguous predictions: Suppose you want predictions to not just resolve to ‘true’ and ‘false’, but also sometimes to the whole range [0,1]; maybe it makes sense to say 0.4 to “it will rain in SF” if only 40% of SF is rained on, for example. (Or a prediction is true in some interpretations but not others, and so it’s more accurate to resolve it as ~80% true instead of all true or false.)
Do the scoring rules handle this sensibly? First let’s assume that the procedure to generate percentages from ambiguous statements is exact; if it resolves to 80%, it’s also the case that the best prediction ahead of time was 80%.
It looks like log scoring does; your actual score becomes the formula for your expected score, with the ambiguous probability a standing in for your predicted probability p. Spherical also allows you to use the expected score as the true score. Brier scoring uses a different mechanism: you just make the observation continuous instead of binary, but again it’s still proper.
[I think it’s neat that they maintain their status as proper because the thing that made them proper—the score is maximized in expectation by reporting your true score—is preserved by this change, because it’s keeping that core point constant.]
Of course, in the real world you might expect the percentages to be resolved with a different mechanism than the underlying reality. Now we have to both estimate reality and the judges, and the optimal probability is going to be a weighted mixture between them.
Scoring ambiguous predictions: Suppose you want predictions to not just resolve to ‘true’ and ‘false’, but also sometimes to the whole range [0,1]; maybe it makes sense to say 0.4 to “it will rain in SF” if only 40% of SF is rained on, for example. (Or a prediction is true in some interpretations but not others, and so it’s more accurate to resolve it as ~80% true instead of all true or false.)
Do the scoring rules handle this sensibly? First let’s assume that the procedure to generate percentages from ambiguous statements is exact; if it resolves to 80%, it’s also the case that the best prediction ahead of time was 80%.
It looks like log scoring does; your actual score becomes the formula for your expected score, with the ambiguous probability a standing in for your predicted probability p. Spherical also allows you to use the expected score as the true score. Brier scoring uses a different mechanism: you just make the observation continuous instead of binary, but again it’s still proper.
[I think it’s neat that they maintain their status as proper because the thing that made them proper—the score is maximized in expectation by reporting your true score—is preserved by this change, because it’s keeping that core point constant.]
Of course, in the real world you might expect the percentages to be resolved with a different mechanism than the underlying reality. Now we have to both estimate reality and the judges, and the optimal probability is going to be a weighted mixture between them.