every single judge thought themselves decently able to discern genuine writing from fakery. The numbers suggest that every single judge was wrong.
I think the first of these claims is a little too pessimistic, and the second may be too.
Here are some comments made by one of the judges (full disclosure: it was me) at the time. “I found these very difficult [...] I had much the same problem [sc. that pretty much every entry felt >50% credible]. [...] almost all my estimates were 40%-60% [...] I fear that this one [...] is just too difficult.” I’m pretty sure (though of course memory is deceptive) that I would not have said that I thought myself “decently able to discern genuine writing from fakery”. (“Almost all” was too strong, though, if I’ve correctly guessed which row in the table is mine. Four of my estimates were 70%. One was 99% but that’s OK because that was my own entry, which I recognized. The others were all 40-60%. Incidentally, I got two of my four 70% guesses right and two wrong, and four of my eight 40%/60% guesses right and four wrong.)
On the second, I remark that judge 14 (full disclosure: this was definitely not me) scored better than +450 and got only two of the 13 entries wrong. The probability of any given judge getting 11⁄13 or better by chance is about 1%. [EDITED to add: As Douglas_Knight points out, it would be better to say 10⁄12 because judge 14 guessed 50% for one entry.] In a sample of 53 people you’ll get someone doing this well just by chance a little over half the time. But wait, the two wrong ones were both 60⁄40 judgements, and judge 14 had a bunch of 70s and 80s and one 90 as well, all of them correct. With judge 14′s probability assignments and random actual results, simulation (I’m too lazy to do it analytically) says that as good a logarithmic score happens only about 0.3% of the time. To figure out exactly what that says about the overall results we’d need some kind of probabilistic model for how people assign their probabilities or something, and I’m way too lazy for that, but my feeling is that judge 14′s results are good enough to suggest genuinely better-than-chance performance.
If anyone wants to own up to being judge 14, I’d be extremely interested to hear what they have to say about their mental processes while judging.
As Douglas_Knight points out, it’s only 10⁄12, a probability of ~0.016. In a sample of ~50 we should see about one person at that level of accuracy or inaccuracy, which is exactly what we see. I’m no more inclined to give #14 a medal than I am to call #43 a dunce. See the histogram I stuck on to the end of the post for more intuition about why I see these extreme results as normal.
I absolutely will fess up to exaggerating in that sentence for the sake of dramatic effect. Some judges, such as yourself, were MUCH less wrong. I hope you don’t mind me outing you as one of the people who got a positive score, and that’s a reflection of you being better calibrated. That said, if you say “I’m 70% confident” four times, and only get it right twice, that’s evidence that you were still (slightly) overconfident when you thought “decently able to discern genuine writing from fakery”.
I’m #43 and I’ll accept my dunce cap. I responded just after I began lurking here. I remember having little confidence in my responses and yet I apparently answered as if I did. I really have no insight into why I answered this way. My cringeworthy results reinforce to me the importance of sticking around and improving my thinking.
that’s a reflection of you being better calibrated
Or, of course, just lucky. If you aren’t giving #14 a medal, you shouldn’t be giving me one either. (Though, as it happens, I have some reason to think my calibration is pretty good.) And yes, I was still slightly overconfident, and my intention in what I wrote above was to make it clear that I recognize that.
I think the first of these claims is a little too pessimistic, and the second may be too.
Here are some comments made by one of the judges (full disclosure: it was me) at the time. “I found these very difficult [...] I had much the same problem [sc. that pretty much every entry felt >50% credible]. [...] almost all my estimates were 40%-60% [...] I fear that this one [...] is just too difficult.” I’m pretty sure (though of course memory is deceptive) that I would not have said that I thought myself “decently able to discern genuine writing from fakery”. (“Almost all” was too strong, though, if I’ve correctly guessed which row in the table is mine. Four of my estimates were 70%. One was 99% but that’s OK because that was my own entry, which I recognized. The others were all 40-60%. Incidentally, I got two of my four 70% guesses right and two wrong, and four of my eight 40%/60% guesses right and four wrong.)
On the second, I remark that judge 14 (full disclosure: this was definitely not me) scored better than +450 and got only two of the 13 entries wrong. The probability of any given judge getting 11⁄13 or better by chance is about 1%. [EDITED to add: As Douglas_Knight points out, it would be better to say 10⁄12 because judge 14 guessed 50% for one entry.] In a sample of 53 people you’ll get someone doing this well just by chance a little over half the time. But wait, the two wrong ones were both 60⁄40 judgements, and judge 14 had a bunch of 70s and 80s and one 90 as well, all of them correct. With judge 14′s probability assignments and random actual results, simulation (I’m too lazy to do it analytically) says that as good a logarithmic score happens only about 0.3% of the time. To figure out exactly what that says about the overall results we’d need some kind of probabilistic model for how people assign their probabilities or something, and I’m way too lazy for that, but my feeling is that judge 14′s results are good enough to suggest genuinely better-than-chance performance.
If anyone wants to own up to being judge 14, I’d be extremely interested to hear what they have to say about their mental processes while judging.
As Douglas_Knight points out, it’s only 10⁄12, a probability of ~0.016. In a sample of ~50 we should see about one person at that level of accuracy or inaccuracy, which is exactly what we see. I’m no more inclined to give #14 a medal than I am to call #43 a dunce. See the histogram I stuck on to the end of the post for more intuition about why I see these extreme results as normal.
I absolutely will fess up to exaggerating in that sentence for the sake of dramatic effect. Some judges, such as yourself, were MUCH less wrong. I hope you don’t mind me outing you as one of the people who got a positive score, and that’s a reflection of you being better calibrated. That said, if you say “I’m 70% confident” four times, and only get it right twice, that’s evidence that you were still (slightly) overconfident when you thought “decently able to discern genuine writing from fakery”.
I’m #43 and I’ll accept my dunce cap. I responded just after I began lurking here. I remember having little confidence in my responses and yet I apparently answered as if I did. I really have no insight into why I answered this way. My cringeworthy results reinforce to me the importance of sticking around and improving my thinking.
Or, of course, just lucky. If you aren’t giving #14 a medal, you shouldn’t be giving me one either. (Though, as it happens, I have some reason to think my calibration is pretty good.) And yes, I was still slightly overconfident, and my intention in what I wrote above was to make it clear that I recognize that.
The judge in row 14 did not get 11⁄13, but 10⁄12, having punted on #8 by assigning 50%. This affects at least your first calculation.
Good catch. But it’s the second calculation that I find more interesting.
There is also a fair chance that that judge recognized at least one of their own entries… 9/11?