Because 8.86×10−30>2.76×10−30 Person A is also a slightly better predictor than person B.
Wait, i got confused by the function you used to assign the calibration score. It worked in that case, but it will yield higher values for those who make more ‘correct’ predictions, not those who are more calibrated. For example, person A predicts 100 things with 60% confidence, 61 of them turns out to occur and person D predicts 100 things with 60% confidence, 60 of them turns out to occur. Person D is more calibrated, but gets a lower score than person A, ~5.9e-30 vs ~8.86e-30 (and person E who made 100 predictions with 60 % confidence, which all turned out to be true, would score ~6.53e-21).
I’m looking forward to read it, because I think one of the current bottlenecks that limit how many predictions i do is that i cannot easily compare how i’m doing week after week, and i have been looking for a model that help me check how i’m doing for several predictions.