The transformer was such an advance that it made the community create a new benchmark, “SuperGLUE,” because the previous gold standard benchmark (GLUE) was now too easy.
GPT-3 is so little of an advance, it doesn’t even do that well at SuperGLUE. It just does okay with its dominant hand tied behind its back.
I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
Update: It seems that GPT-3 can actually do quite well (maybe SOTA? Human-level-ish it seems) at SuperGLUE with the right prompt (which I suppose you can say is a kind of fine-tuning, but it’s importantly different from what everyone meant by fine-tuning at the time this article was written!) What do you think of this?
This is also a reply to your passage in the OP:
I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
BERT/T5 at param count N
GPT at param count ~100 * N
See my comments near the bottom here.
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
Ah! You are right, I misread the graph. *embarrassed* Thanks for the correction!