For example, Scott ending up ~60% right on the things that he thinks are 50% likely suggests that he’s throwing away some of his signal
If we compare two hypotheses:
Perfect calibration at 50%
vs
Unknown actual calibration (uniform prior across [0,1])
Then the Bayes factor is 2:1 in favour of the former hypothesis (for 7⁄11 correct) so it seems that Scott isn’t throwing away information. Looking across other years supports this—his total of 30 out of 65 is 5:1 evidence in favour of the former hypothesis.
So I was mostly averaging percentages across the years instead of counting, which isn’t great; knowing that it’s 30⁄65 makes me much more on board with “oh yeah there’s no signal there.”
But I think your comparison between hypotheses seems wrong; like, presumably it should be closer to a BIC-style test, where you decide if it’s worth storing the extra parameter p?
The Bayes factor calculation which I did is the analytical result for which BIC is an approximation (see this sequence). Generally BIC is a large N approximation but in this case they actually do end up being fairly similar even with low N.
If we compare two hypotheses:
Perfect calibration at 50%
vs
Unknown actual calibration (uniform prior across [0,1])
Then the Bayes factor is 2:1 in favour of the former hypothesis (for 7⁄11 correct) so it seems that Scott isn’t throwing away information. Looking across other years supports this—his total of 30 out of 65 is 5:1 evidence in favour of the former hypothesis.
So I was mostly averaging percentages across the years instead of counting, which isn’t great; knowing that it’s 30⁄65 makes me much more on board with “oh yeah there’s no signal there.”
But I think your comparison between hypotheses seems wrong; like, presumably it should be closer to a BIC-style test, where you decide if it’s worth storing the extra parameter p?
The Bayes factor calculation which I did is the analytical result for which BIC is an approximation (see this sequence). Generally BIC is a large N approximation but in this case they actually do end up being fairly similar even with low N.