What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)
But that’s not what is going on here, and as one recent review put it, Bonferroni should not be applied “routinely”. It makes sense to use it when there are many uncorrelated tests and no clear prior hypothesis, as in the XKCD cartion. But here there is an obvious a priori test: does using an LLM make people more accurate? That’s what the whole paper is about. You don’t need a Bonferroni correction for that, and shouldn’t be using it. Deliberately or not (my guess is not), OpenAI has misanalyzed their data1 in a way which underreports the potential risk. As a statistician friend put it “if somebody was just doing stats robotically, they might do it this way, but it is the wrong test for what we actually care about”.
In fact, if you simply collapsed all the measurements of accuracy, and did the single most obvious test here, a simple t-test, the results would (as Footnote C implies) be significant. A more sophisticated test would be an ANCOVA, which as another knowledgeable academic friend with statistical expertise put it, having read a draft of this essay, “would almost certainly support your point that an omnibus measure of AI boost (from a weighted sum of the five dependent variables) would show a massively significant main effect, given that 9 out of the 10 pairwise comparisons were in the same direction.
Also, there was likely an effect, but sample sizes were too small to detect this:
There were 50 experts; 25 with LLM access, 25 without. From the reprinted table we can see that 1 in 25 (4%) experts without LLMs succeeded in the formulation task, whereas 4 in 25 with LLM access succeeded (16%).
Gary Marcus has criticised the results here:
Also, there was likely an effect, but sample sizes were too small to detect this:
Thanks for pointing this out! I’ve added a note about it to the main post.