This post holds up well in hindsight. I still endorse most of the critiques here, and the ones I don’t endorse are relatively unimportant. Insofar as we have new evidence, I think it tends to support the claims here.
In particular:
Framing few-shot learning as “meta-learning” has caused a lot of confusion. This framing made little sense to begin with, for the reasons I note in this post, and there is now some additional evidence against it.
The paper does very little to push the envelope of what is possible in NLP, even though GPT-3 is probable capable of pushing the envelope. The paper spends very little time getting GPT-3 to do new things that were not previously possible. Instead it spends most of its time reproducing BERT results. ”Our 175B model can do as well as BERT” is an underwhelming punchline, and if anything an inadvertent argument against prompting as a technique.
Both these points are still not appreciated as broadly as I think they ought to be.
I’m not sure how much lasting value this post has. My recent post here covers the same ground more carefully.
I’m not sure if this is relevant, but this post received some very critical comments, leading me to seriously question the value of continuing to write posts like this on LW. See here for a discussion about this with a reader of my blog. I did continue to write posts like this, and they have been well received, even when they reiterated my arguments here. I am curious what explains this difference, and have no good hypotheses.
Some points here I no longer endorse:
I no longer care whether the new model “deserves” the name GPT-3, and I shouldn’t have mixed this inane gripe with serious critiques. (I had forgotten at the time, but when GPT-2 was first announced, I made a similar objection to its name.)
“this is about the least interesting transformer paper one can imagine in 2020” is just wrong, even as hyperbole.
The paper crosses a larger scale gulf than earlier ones, and it’s valuable to know what happens as you scale that much, even if what happens is “nothing especially novel.”
Related: I had a vague impression from other work that “scaled-up transformers are fundamentally like smaller ones.” Here, I acted more confident of that claim than I had reason to be, and also assumed it was an established consensus, which it (clearly!) wasn’t. I still think this claim is true, but it’s a point of contention even today.
I didn’t understand that OpenAI “really meant it” about few-shot results.
That is, I assumed that “obviously” no one would use few-shot as a practical technique, and thought OpenAI was exhibiting these results to illuminate the model’s properties, where in fact OpenAI really believes in a future where we interact with LMs entirely through natural language prompting.
The release of the API (some time after this post) blindsided me.
I had the same “obviously this isn’t practical” response to GPT-2 zero-shot results, though when I go back and read the GPT-2 paper, it’s clear the authors “really mean it” there too.
I’m not sure how much lasting value this post has. My recent post here covers the same ground more carefully.
I’m not sure if this is relevant, but this post received some very critical comments, leading me to seriously question the value of continuing to write posts like this on LW. See here for a discussion about this with a reader of my blog. I did continue to write posts like this, and they have been well received, even when they reiterated my arguments here. I am curious what explains this difference, and have no good hypotheses.
If you feel like “larger language models may disappoint you” was one of the posts that reiterated your arguments here, they seem to be saying pretty different things to me? It feels like this article is fundamentally focused on talking about the GPT-3 paper whereas your later post is focused on talking about GPT-3 itself.
The later post still reiterates the main claims from this post, though.
This post: “Few-shot learning results are philosophically confusing and numerically unimpressive; the GPT-3 paper was largely a collection of few-shot learning results, therefore the paper was disappointing”
The later post: “Few-shot learning results are philosophically confusing and numerically unimpressive; therefore we don’t understand GPT-3′s capabilities well and should use more ‘ecological’ methods instead”
Many commenters on this post disagreed with the part that both posts share (“Few-shot learning results are philosophically confusing and numerically unimpressive”).
This post holds up well in hindsight. I still endorse most of the critiques here, and the ones I don’t endorse are relatively unimportant. Insofar as we have new evidence, I think it tends to support the claims here.
In particular:
Framing few-shot learning as “meta-learning” has caused a lot of confusion. This framing made little sense to begin with, for the reasons I note in this post, and there is now some additional evidence against it.
The paper does very little to push the envelope of what is possible in NLP, even though GPT-3 is probable capable of pushing the envelope. The paper spends very little time getting GPT-3 to do new things that were not previously possible. Instead it spends most of its time reproducing BERT results.
”Our 175B model can do as well as BERT” is an underwhelming punchline, and if anything an inadvertent argument against prompting as a technique.
Both these points are still not appreciated as broadly as I think they ought to be.
I’m not sure how much lasting value this post has. My recent post here covers the same ground more carefully.
I’m not sure if this is relevant, but this post received some very critical comments, leading me to seriously question the value of continuing to write posts like this on LW. See here for a discussion about this with a reader of my blog. I did continue to write posts like this, and they have been well received, even when they reiterated my arguments here. I am curious what explains this difference, and have no good hypotheses.
Some points here I no longer endorse:
I no longer care whether the new model “deserves” the name GPT-3, and I shouldn’t have mixed this inane gripe with serious critiques. (I had forgotten at the time, but when GPT-2 was first announced, I made a similar objection to its name.)
“this is about the least interesting transformer paper one can imagine in 2020” is just wrong, even as hyperbole.
The paper crosses a larger scale gulf than earlier ones, and it’s valuable to know what happens as you scale that much, even if what happens is “nothing especially novel.”
Related: I had a vague impression from other work that “scaled-up transformers are fundamentally like smaller ones.” Here, I acted more confident of that claim than I had reason to be, and also assumed it was an established consensus, which it (clearly!) wasn’t. I still think this claim is true, but it’s a point of contention even today.
I didn’t understand that OpenAI “really meant it” about few-shot results.
That is, I assumed that “obviously” no one would use few-shot as a practical technique, and thought OpenAI was exhibiting these results to illuminate the model’s properties, where in fact OpenAI really believes in a future where we interact with LMs entirely through natural language prompting.
The release of the API (some time after this post) blindsided me.
I had the same “obviously this isn’t practical” response to GPT-2 zero-shot results, though when I go back and read the GPT-2 paper, it’s clear the authors “really mean it” there too.
If you feel like “larger language models may disappoint you” was one of the posts that reiterated your arguments here, they seem to be saying pretty different things to me? It feels like this article is fundamentally focused on talking about the GPT-3 paper whereas your later post is focused on talking about GPT-3 itself.
The later post still reiterates the main claims from this post, though.
This post: “Few-shot learning results are philosophically confusing and numerically unimpressive; the GPT-3 paper was largely a collection of few-shot learning results, therefore the paper was disappointing”
The later post: “Few-shot learning results are philosophically confusing and numerically unimpressive; therefore we don’t understand GPT-3′s capabilities well and should use more ‘ecological’ methods instead”
Many commenters on this post disagreed with the part that both posts share (“Few-shot learning results are philosophically confusing and numerically unimpressive”).