“Always phrase predictions such that the confidence is above the baseline probability”—This really seems like it should not matter. I don’t have a cohesive argument against it at this stage, but reversing should fundamentally be the same prediction.
So I’ve thought about this a bit more. It doesn’t matter how someone states their probabilities. However, in order to use your evaluation technique we just need to transform the probabilities so that all of them are above the baseline.
Yes, I think that’s exactly right. Statements are symmetric: 50% that X happens ⟺50% that ¬X happens. But evaluation is not symmetric. So you can consider each prediction as making two logically equivalent claims (X happens with p probability and ¬X happens with 1−p probability) plus stating on which one of the two you want to be evaluated on. But this is important because the two claims will miss the “correct” probability in different directions. If 50% confidence is too high for X (Tesla stock price is in narrow range) then 50% is too low for ¬X (Tesla stock price outside narrow range).
(Plu in any case it’s not clear that we can always agree on a baseline probability)
I think that’s the reason why calibration is inherently impressive to some extent. If it was actually boldness multiplied by calibration, then you should not be impressed at all whenever the boldness pile and confidence pile have identical height. And I think that’s correct in theory; if I just make predictions about dice all day, you shouldn’t be impressed at all regardless of the outcome. But since it takes some skill to estimate the baseline for all practical purposes, boldness doesn’t go to zero.
Yes, I think that’s exactly right. Statements are symmetric: 50% that X happens ⟺50% that ¬X happens. But evaluation is not symmetric. So you can consider each prediction as making two logically equivalent claims (X happens with p probability and ¬X happens with 1−p probability) plus stating on which one of the two you want to be evaluated on. But this is important because the two claims will miss the “correct” probability in different directions. If 50% confidence is too high for X (Tesla stock price is in narrow range) then 50% is too low for ¬X (Tesla stock price outside narrow range).
I think that’s the reason why calibration is inherently impressive to some extent. If it was actually boldness multiplied by calibration, then you should not be impressed at all whenever the boldness pile and confidence pile have identical height. And I think that’s correct in theory; if I just make predictions about dice all day, you shouldn’t be impressed at all regardless of the outcome. But since it takes some skill to estimate the baseline for all practical purposes, boldness doesn’t go to zero.