I tried the calibration exercise you linked. Skipped one question where I felt I just had no basis at all for answering, but answered all the rest, even when I felt very unsure.
When I said 95% confident, my accuracy was 100% (9/9) When I said 85% confident, my accuracy was 83% (5/6) When I said 75% confident, my accuracy was 71% (5/7) When I said 65% confident, my accuracy was 60% (3/5)
At a glance, that looks like it’s within rounding error of perfect. So I was feeling pretty good about my calibration, until...
When I said 55% confident, my accuracy was 92% (11/12)
I, er, uh...what? How can I be well-calibrated at every other confidence level and then get over 90% right when I think I’m basically guessing?
Null Hypothesis: Random fluke? Quick mental calculation says winning at least 11 out of 12 coin-flips would be p < .01. Plus, this is a larger sample than any other confidence level, so if I’m not going to believe this, I probably shouldn’t believe any of the other results, either.
(Of course, from your perspective, I’m the one person out of who-knows-how-many test takers that got a weird result and self-selected to write a post about it. But from my perspective it seems pretty surprising.)
Hypothesis #1: There are certain subject areas where I feel like I know stuff, and other subject areas where I feel like I don’t know stuff, and I’m slightly over-confident in the former but drastically under-confident in the later.
This seems likely true to some extent—I gave much less confidence overall in the “country populations” test section, but my actual accuracy there was about the same as other categories. But I also said 55% twice in each of the other 3 test sections (and got all 6 of those correct), so it seems hard to draw a natural subject-area boundary that would fully explain the results.
Hypothesis #2: When I believe I don’t have any “real” knowledge, I switch mental gears to using a set of heuristics that turns out to be weirdly effective, at least on this particular test. (Maybe the test is constructed with some subtle form of bias that I’m subconsciously exploiting, but only in this mental mode?)
For example, on one question where the test asked if country X or Y had a higher population in 2019, I gave a correct, 55% confident answer on the basis of “I vaguely feel like I hear about country X a little more often than country Y, and high population seems like it would make me more likely to hear about a country, so I suppose that’s a tiny shred of Bayesian evidence for X.”
I have a hard time believing heuristics like that are 90% accurate, though.
Other hypotheses?
Possibly relevant: I also once tried playing CFAR’s calibration game, and after 30-something binary questions in that game, I had around 40% overall accuracy (i.e. worse than random chance). I think that was probably bad luck rather than actual anti-knowledge, but I concluded that I can’t use that game due to lack of relevant knowledge.
I somehow missed all notifications of your reply and just stumbled upon it by chance when sharing this post with someone.
I had something very similar with my calibration results, only it was for 65% estimates:
I think your hypotheses 1 and 2 match with my intuitions about why this pattern emerges on a test like this. Personally, I feel like a combination of 1 and 2 is responsible for my “blip” at 65%.
I’m also systematically under-confident here — that’s because I cut my prediction teeth getting black swanned during 2020, so I tend to leave considerable room for tail events (which aren’t captured in this test). I’m not upset about that, as I think it makes for better calibration “in the wild.”
I tried the calibration exercise you linked. Skipped one question where I felt I just had no basis at all for answering, but answered all the rest, even when I felt very unsure.
When I said 95% confident, my accuracy was 100% (9/9)
When I said 85% confident, my accuracy was 83% (5/6)
When I said 75% confident, my accuracy was 71% (5/7)
When I said 65% confident, my accuracy was 60% (3/5)
At a glance, that looks like it’s within rounding error of perfect. So I was feeling pretty good about my calibration, until...
When I said 55% confident, my accuracy was 92% (11/12)
I, er, uh...what? How can I be well-calibrated at every other confidence level and then get over 90% right when I think I’m basically guessing?
Null Hypothesis: Random fluke? Quick mental calculation says winning at least 11 out of 12 coin-flips would be p < .01. Plus, this is a larger sample than any other confidence level, so if I’m not going to believe this, I probably shouldn’t believe any of the other results, either.
(Of course, from your perspective, I’m the one person out of who-knows-how-many test takers that got a weird result and self-selected to write a post about it. But from my perspective it seems pretty surprising.)
Hypothesis #1: There are certain subject areas where I feel like I know stuff, and other subject areas where I feel like I don’t know stuff, and I’m slightly over-confident in the former but drastically under-confident in the later.
This seems likely true to some extent—I gave much less confidence overall in the “country populations” test section, but my actual accuracy there was about the same as other categories. But I also said 55% twice in each of the other 3 test sections (and got all 6 of those correct), so it seems hard to draw a natural subject-area boundary that would fully explain the results.
Hypothesis #2: When I believe I don’t have any “real” knowledge, I switch mental gears to using a set of heuristics that turns out to be weirdly effective, at least on this particular test. (Maybe the test is constructed with some subtle form of bias that I’m subconsciously exploiting, but only in this mental mode?)
For example, on one question where the test asked if country X or Y had a higher population in 2019, I gave a correct, 55% confident answer on the basis of “I vaguely feel like I hear about country X a little more often than country Y, and high population seems like it would make me more likely to hear about a country, so I suppose that’s a tiny shred of Bayesian evidence for X.”
I have a hard time believing heuristics like that are 90% accurate, though.
Other hypotheses?
Possibly relevant: I also once tried playing CFAR’s calibration game, and after 30-something binary questions in that game, I had around 40% overall accuracy (i.e. worse than random chance). I think that was probably bad luck rather than actual anti-knowledge, but I concluded that I can’t use that game due to lack of relevant knowledge.
I somehow missed all notifications of your reply and just stumbled upon it by chance when sharing this post with someone.
I had something very similar with my calibration results, only it was for 65% estimates:
I think your hypotheses 1 and 2 match with my intuitions about why this pattern emerges on a test like this. Personally, I feel like a combination of 1 and 2 is responsible for my “blip” at 65%.
I’m also systematically under-confident here — that’s because I cut my prediction teeth getting black swanned during 2020, so I tend to leave considerable room for tail events (which aren’t captured in this test). I’m not upset about that, as I think it makes for better calibration “in the wild.”