GPT-3 is so little of an advance, it doesn’t even do that well at SuperGLUE. It just does okay with its dominant hand tied behind its back.
This was my biggest take-away. Not having read the GPT-3 paper, and not having heard of superGLUE (but having read Gwern’s glowing review of GPT-3), I fully expected that GPT-3 few-shot learning would beat state-of-the-art on a benchmark like this!
To be fair, it’s not an apples-to-apples comparison.
GPT-3 few-shot learning gets to use less data. (Although much of superGLUE has tiny train sets, so this gap isn’t as big as it sounds.) And with GPT-3 you don’t have the storage overhead of a separate trained model for every task.
Back when I wrote this post, I really did not realize that OpenAI was serious about few-shot learning as a practical, competitive approach. I had assumed it was meant as a conceptual demonstration of meta-learning, or a new way to probe what LMs “know.”
In other words, I implicitly assumed “oh, of course they aren’t planning [something like the OpenAI API], it’d be uncharitable to assume they actually think this is a practical approach.” Now it’s clear that they do think that, which makes for a different conversation than the one I had expected here. (I’m still bearish on the approach, though.)
The transformer was such an advance that it made the community create a new benchmark, “SuperGLUE,” because the previous gold standard benchmark (GLUE) was now too easy.
GPT-3 is so little of an advance, it doesn’t even do that well at SuperGLUE. It just does okay with its dominant hand tied behind its back.
I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
This was my biggest take-away. Not having read the GPT-3 paper, and not having heard of superGLUE (but having read Gwern’s glowing review of GPT-3), I fully expected that GPT-3 few-shot learning would beat state-of-the-art on a benchmark like this!
To be fair, it’s not an apples-to-apples comparison.
GPT-3 few-shot learning gets to use less data. (Although much of superGLUE has tiny train sets, so this gap isn’t as big as it sounds.) And with GPT-3 you don’t have the storage overhead of a separate trained model for every task.
Back when I wrote this post, I really did not realize that OpenAI was serious about few-shot learning as a practical, competitive approach. I had assumed it was meant as a conceptual demonstration of meta-learning, or a new way to probe what LMs “know.”
In other words, I implicitly assumed “oh, of course they aren’t planning [something like the OpenAI API], it’d be uncharitable to assume they actually think this is a practical approach.” Now it’s clear that they do think that, which makes for a different conversation than the one I had expected here. (I’m still bearish on the approach, though.)
Update: It seems that GPT-3 can actually do quite well (maybe SOTA? Human-level-ish it seems) at SuperGLUE with the right prompt (which I suppose you can say is a kind of fine-tuning, but it’s importantly different from what everyone meant by fine-tuning at the time this article was written!) What do you think of this?
This is also a reply to your passage in the OP:
I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
BERT/T5 at param count N
GPT at param count ~100 * N
See my comments near the bottom here.
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
Ah! You are right, I misread the graph. *embarrassed* Thanks for the correction!