I believe what “technically meaningless” means in that sentence is something like “the simple rule doesn’t distinguish between predictions.” The “Biden 60%” prediction has one canonical form, but a “Biden 50%” prediction is as canonical as a “not Biden 50%” prediction. So you have to use some other rule to distinguish them, and that means the 50% column is meaningfully different from the other columns on the graph.
For example, Scott ending up ~60% right on the things that he thinks are 50% likely suggests that he’s throwing away some of his signal, in that his ‘arbitrary’ rule for deciding whether to write “Ginsberg still alive” instead of “Ginsberg not still alive” (and similar calls) is in fact weakly favoring the one that ends up happening. (This looks like a weak effect in previous years as well, tho it sometimes reverses.)
For example, Scott ending up ~60% right on the things that he thinks are 50% likely suggests that he’s throwing away some of his signal
If we compare two hypotheses:
Perfect calibration at 50%
vs
Unknown actual calibration (uniform prior across [0,1])
Then the Bayes factor is 2:1 in favour of the former hypothesis (for 7⁄11 correct) so it seems that Scott isn’t throwing away information. Looking across other years supports this—his total of 30 out of 65 is 5:1 evidence in favour of the former hypothesis.
So I was mostly averaging percentages across the years instead of counting, which isn’t great; knowing that it’s 30⁄65 makes me much more on board with “oh yeah there’s no signal there.”
But I think your comparison between hypotheses seems wrong; like, presumably it should be closer to a BIC-style test, where you decide if it’s worth storing the extra parameter p?
The Bayes factor calculation which I did is the analytical result for which BIC is an approximation (see this sequence). Generally BIC is a large N approximation but in this case they actually do end up being fairly similar even with low N.
Yes, and in particular, by Scott saying that 50% predictions are “technically meaningless.”
I believe what “technically meaningless” means in that sentence is something like “the simple rule doesn’t distinguish between predictions.” The “Biden 60%” prediction has one canonical form, but a “Biden 50%” prediction is as canonical as a “not Biden 50%” prediction. So you have to use some other rule to distinguish them, and that means the 50% column is meaningfully different from the other columns on the graph.
For example, Scott ending up ~60% right on the things that he thinks are 50% likely suggests that he’s throwing away some of his signal, in that his ‘arbitrary’ rule for deciding whether to write “Ginsberg still alive” instead of “Ginsberg not still alive” (and similar calls) is in fact weakly favoring the one that ends up happening. (This looks like a weak effect in previous years as well, tho it sometimes reverses.)
If we compare two hypotheses:
Perfect calibration at 50%
vs
Unknown actual calibration (uniform prior across [0,1])
Then the Bayes factor is 2:1 in favour of the former hypothesis (for 7⁄11 correct) so it seems that Scott isn’t throwing away information. Looking across other years supports this—his total of 30 out of 65 is 5:1 evidence in favour of the former hypothesis.
So I was mostly averaging percentages across the years instead of counting, which isn’t great; knowing that it’s 30⁄65 makes me much more on board with “oh yeah there’s no signal there.”
But I think your comparison between hypotheses seems wrong; like, presumably it should be closer to a BIC-style test, where you decide if it’s worth storing the extra parameter p?
The Bayes factor calculation which I did is the analytical result for which BIC is an approximation (see this sequence). Generally BIC is a large N approximation but in this case they actually do end up being fairly similar even with low N.