methods based on rounding probabilities are hot flaming garbage
I think this depends a lot on what you’re interested in, i.e. what scoring rules you use. Someone who runs the same analysis with Brier instead of log-scores might disagree.
More generally, I’m not convinced it makes sense to think of “precision” as a constant, let alone a universal one, since it depends on
the scoring rule in question: Imagine a set of forecasts that’s awfully calibrated on values <1% and >99%, but perfectly calibrated on values between 1% and 99%. With the log-score, this will probably get a bad precision value, while with Brier this would give a great one.
someone’s calibration, as you point out with your final calibration plot.
I believe that these approaches are not good: For small datasets they produce large oscillations in the score, not smooth declines, and they improve the scores of worse-than-random forecast datasets.
I don’t think it’s very counterintuitive/undesirable for (what, in practice, is essentially) noise to make worse-than-random forecasts better. As a matter of fact, this also happens if you replace log-scores with Brier in your analysis with random noise instead of rounding.
Also, regarding oscillations: I don’t think properties of “precision” obtained from small datasets are too important, for similar reasons why I usually don’t pay a lot of attention to calibration plots obtained from a handful of forecasts.
As we increase the perturbation, the score falls ~monotonically (which I conjecture to always be true in the limit of infinitely many samples)
This conjecture is true and should easily generalise to more general 1-parameter families of centered, symmetric distributions admitting suitable couplings (e.g. additive N(0,\sigma^2) noise in log-odds space) using the fact that log(sigmoid(x+y))+log(sigmoid(x-y)) is decreasing in y for all log-odds x and all positive y (QED). (NB: This fails when replacing log-scores with Brier.)
Rounding very strongly rounds everything to 50%, so with strong enough rounding every dataset has the same score.
I could make a similar argument for the noise-based version, if I chose to use Brier (or any other scoring rule S that depends only on |p-outcome| and converges to finite values as p tends towards 0 and 1): With sufficiently strong noise, every forecast becomes ≈0% and ≈100% with equal probability, so the expected score in the “large noise limit” converges to (S(0, outcome) + S(1, outcome))/2.
I think this depends a lot on what you’re interested in, i.e. what scoring rules you use. Someone who runs the same analysis with Brier instead of log-scores might disagree.
More generally, I’m not convinced it makes sense to think of “precision” as a constant, let alone a universal one, since it depends on
the scoring rule in question: Imagine a set of forecasts that’s awfully calibrated on values <1% and >99%, but perfectly calibrated on values between 1% and 99%. With the log-score, this will probably get a bad precision value, while with Brier this would give a great one.
someone’s calibration, as you point out with your final calibration plot.
I don’t think it’s very counterintuitive/undesirable for (what, in practice, is essentially) noise to make worse-than-random forecasts better. As a matter of fact, this also happens if you replace log-scores with Brier in your analysis with random noise instead of rounding.
Also, regarding oscillations: I don’t think properties of “precision” obtained from small datasets are too important, for similar reasons why I usually don’t pay a lot of attention to calibration plots obtained from a handful of forecasts.
This conjecture is true and should easily generalise to more general 1-parameter families of centered, symmetric distributions admitting suitable couplings (e.g. additive N(0,\sigma^2) noise in log-odds space) using the fact that log(sigmoid(x+y))+log(sigmoid(x-y)) is decreasing in y for all log-odds x and all positive y (QED).
(NB: This fails when replacing log-scores with Brier.)
I could make a similar argument for the noise-based version, if I chose to use Brier (or any other scoring rule S that depends only on |p-outcome| and converges to finite values as p tends towards 0 and 1): With sufficiently strong noise, every forecast becomes ≈0% and ≈100% with equal probability, so the expected score in the “large noise limit” converges to (S(0, outcome) + S(1, outcome))/2.