Just to confirm: Writing , the probability of at time , as (here is the sigma-algebra at time ), we see that must be a martingale via the tower rule.
The log-odds are not martingales unless because Itô gives us
So unless is continuous and of bounded variation (⇒ , but this also implies that ; the integrand of the drift part only vanishes if for all ), the log-odds are not a martingale.
Interesting analysis on log-odds might still be possible (just use and for discrete-time/jump processes as we naturally get when working with real data), but it’s not obvious to me if this comes with any advantages over just working with directly.
I think this depends a lot on what you’re interested in, i.e. what scoring rules you use. Someone who runs the same analysis with Brier instead of log-scores might disagree.
More generally, I’m not convinced it makes sense to think of “precision” as a constant, let alone a universal one, since it depends on
the scoring rule in question: Imagine a set of forecasts that’s awfully calibrated on values <1% and >99%, but perfectly calibrated on values between 1% and 99%. With the log-score, this will probably get a bad precision value, while with Brier this would give a great one.
someone’s calibration, as you point out with your final calibration plot.
I don’t think it’s very counterintuitive/undesirable for (what, in practice, is essentially) noise to make worse-than-random forecasts better. As a matter of fact, this also happens if you replace log-scores with Brier in your analysis with random noise instead of rounding.
Also, regarding oscillations: I don’t think properties of “precision” obtained from small datasets are too important, for similar reasons why I usually don’t pay a lot of attention to calibration plots obtained from a handful of forecasts.
This conjecture is true and should easily generalise to more general 1-parameter families of centered, symmetric distributions admitting suitable couplings (e.g. additive N(0,\sigma^2) noise in log-odds space) using the fact that log(sigmoid(x+y))+log(sigmoid(x-y)) is decreasing in y for all log-odds x and all positive y (QED).
(NB: This fails when replacing log-scores with Brier.)
I could make a similar argument for the noise-based version, if I chose to use Brier (or any other scoring rule S that depends only on |p-outcome| and converges to finite values as p tends towards 0 and 1): With sufficiently strong noise, every forecast becomes ≈0% and ≈100% with equal probability, so the expected score in the “large noise limit” converges to (S(0, outcome) + S(1, outcome))/2.