I’ve just looked at scoring functions for predictions. There’s the Brier score, which measures the squared distance from probabilities (p1,...,pn with pi∈[0,1]) to outcomes (x1,...,xn with xi∈{0,1}), i.e.,
−n∑i=1(pi−xi)2
(Maybe scaled by 1n.) Then there’s logarithmic scoring, which sums up the logarithms of the probabilities that came to pass, i.e.,
n∑i=1xiln(pi)+(1−xi)ln(1−pi)
Both of these have the property that, if k out of n predictions come true, then the probability that maximizes your score (provided you have to choose the same probability for all predictions) is kn. That’s good. However, logarithmic scoring also has the property that, for a prediction that came true, as your probability for that prediction approaches 0, your score approaches −∞. This feels like a property that any system should have; for an event that came true, predicting (1:100) odds is much less bad than (1:1000) odds, which is much less bad than (1:10000) odds and so forth. The penalty shouldn’t be bounded.
Brier score has bounded penalties. For xi=1, the predictions pi=11000 and pi=1101000 receive almost identical scores. This seems deeply philosophically wrong. Why is anyone using Brier scoring? Do people disagree with the intuition that penalties should be unbounded?
Yeah, I also don’t like Brier scores. My guess is they are better at allowing people to pretend that brier scores on different sets of forecasts are meaningfully comparable (producing meaningless sentences like “superforecasters generally have a brier score around 0.35”), whereas the log scoring rule only ever loses you points, so it’s more clear that it really isn’t comparable between different question sets.
I’ve just looked at scoring functions for predictions. There’s the Brier score, which measures the squared distance from probabilities (p1,...,pn with pi∈[0,1]) to outcomes (x1,...,xn with xi∈{0,1}), i.e.,
−n∑i=1(pi−xi)2
(Maybe scaled by 1n.) Then there’s logarithmic scoring, which sums up the logarithms of the probabilities that came to pass, i.e.,
n∑i=1xiln(pi)+(1−xi)ln(1−pi)
Both of these have the property that, if k out of n predictions come true, then the probability that maximizes your score (provided you have to choose the same probability for all predictions) is kn. That’s good. However, logarithmic scoring also has the property that, for a prediction that came true, as your probability for that prediction approaches 0, your score approaches −∞. This feels like a property that any system should have; for an event that came true, predicting (1:100) odds is much less bad than (1:1000) odds, which is much less bad than (1:10000) odds and so forth. The penalty shouldn’t be bounded.
Brier score has bounded penalties. For xi=1, the predictions pi=11000 and pi=1101000 receive almost identical scores. This seems deeply philosophically wrong. Why is anyone using Brier scoring? Do people disagree with the intuition that penalties should be unbounded?
Yeah, I also don’t like Brier scores. My guess is they are better at allowing people to pretend that brier scores on different sets of forecasts are meaningfully comparable (producing meaningless sentences like “superforecasters generally have a brier score around 0.35”), whereas the log scoring rule only ever loses you points, so it’s more clear that it really isn’t comparable between different question sets.
In practice, you can’t (monetarily) reward forecasters with unbounded scoring rules. You may also want scoring rules to be somewhat forgiving.