I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
I’m confused—the paper you link is not about better prompts for GPT-3. It’s about a novel fine-tuning methodology for T5. GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.
The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.
Because of this, I sometimes refer to GPT-3 as “quantifying the cost (in additional scale) imposed by choosing a GPT-style model.” That is, the following should be roughly competitive w/ each other:
BERT/T5 at param count N
GPT at param count ~100 * N
See my comments near the bottom here.
Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.
Your comment claims that the “SOTA” within that line of work is close to the overall SOTA on SuperGLUE—which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks. However, I’d need to see a reference that actually establishes this.
Ah! You are right, I misread the graph. *embarrassed* Thanks for the correction!