Perhaps I wasn’t clear—when I cited my experience as an ML practitioner, I did so in support of a claim about whether the stated capabilities of GPT-3 sound useful, not as a point about what those capabilities are.
I don’t think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.
(I suppose it’s conceivable that few-shot learning with a large model is “secretly useful” in some way not conveyed in the paper, but that’s true of any paper, so if this proves anything then it proves too much.)
A smell test: what do you think your past experience would have predicted about the performance of a 175B-parameter model in advance?
Above I argued this question was orthogonal to my point, but to answer it anyway: I’d certainly predict better performance on LM tasks, as a simple extrapolation of the existing “biggening” research (GPT-2 at 1.5B parameters, Megatron-LM at 8.3B, T5 at 11B, T-NLG at 17B).
For downstream tasks, I’d expect similar scaling: certainly with fine-tuning (given T5′s success on SuperGLUE) though GPT-3 was not fine-tuned, and also with unsupervised approaches (zero-shot, few-shot) given the reported scaling of GPT-2 zero-shot with model size (GPT-2 Fig 1).
I also would have predicted that fine-tuning still out-performs unsupervised approaches by a large margin on most tasks, a gap we observe with unsupervised GPT-3 vs. fine-tuned smaller models (presumably comparing to fine-tuned 175B models would yield an even larger gap).
I alluded to all this in the post, as did the GPT-3 authors in their paper: the results demonstrate that existing trends continue up to 175B. As Daniel Kokotajlo says, the new observation confirms an already familiar, though previously untested, prediction.
I don’t think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.
This comparison seems disingenuous. The goal of the BERT paper was to introduce a novel training method for Transformer-based models that measurably outperformed previous training methods. Conversely, the goal of the GPT-3 paper seems to be to investigate the performance of an existing training method when scaled up to previously unreached (and unreachable) model sizes. I would expect you to agree that these are two very different things, surely?
More generally, it seems to me that you’ve been consistently conflating the practical usefulness of a result with how informative said result is. Earlier, you wrote that “few-shot LM prediction” (not GPT-3 specifically, few-shot prediction in general!) doesn’t sound that promising to you because the specific model discussed in the paper doesn’t outperform SOTA on all benchmarks, and also requires currently impractical levels of hardware/compute. Setting aside the question of whether this original claim resembles the one you just made in your latest response to me (it doesn’t), neither claim addresses what, in my view, are the primary implications of the GPT-3 paper—namely, what it says about the viability of few-shot prediction as model capacity continues to increase.
This, incidentally, is why I issued the “smell test” described in the grandparent, and your answer more or less confirms what I initially suspected: the paper comes across as unsurprising to you because you largely had no concrete predictions to begin with, beyond the trivial prediction that existing trends will persist to some (unknown) degree. (In particular, I didn’t see anything in what you wrote that indicates an overall view of how far the capabilities current language models are from human reasoning ability, and what that might imply about where model performance might start flattening with increased scaling.)
Since it doesn’t appear that you had any intuitions to begin with about what GPT-3′s results might indicate about the scalability of language models in general, it makes sense that your reading of the paper would be framed in terms of practical applications, of which (quite obviously) there are currently none.
what, in my view, are the primary implications of the GPT-3 paper—namely, what it says about the viability of few-shot prediction as model capacity continues to increase
This seems like one crux of our disagreement. If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with “few-shot + large LM” as an approach.
I don’t think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either
a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)
The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on “less downstream” tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.
On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.
I find it difficult to express just what I find unimpressive here without further knowledge of your position. (There is an asymmetry: “there is value in this paper” is a there-exists-an-x claim, while “there is no value in this paper” is a for-all-x claim. I’m not arguing for-all-x, only that I have not seen any x yet.)
All I can do is enumerate and strike out all the “x”s I can think of. Does few-shot learning look promising in the scaling limit?
As a tool for humans: no, I expect fine-tuning will always be preferred.
As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
I’m not seeing how you distinguish between the following two hypotheses:
GPT-3 exhibits mostly flat scaling at the tasks you mention underneath your first bullet point (WiC, MultiRC, etc.) because its architecture is fundamentally unsuited to those tasks, such that increasing the model capacity will lead to little further improvement.
Even 175B parameters isn’t sufficient to perform well on certain tasks (given a fixed architecture), but increasing the number of parameters will eventually cause performance on said tasks to undergo a large increase (akin to something like a phase change in physics).
It sounds like you’re implicitly taking the first hypothesis as a given (e.g. when you assert that there is a “remaining gap vs. fine-tuning that seems [unlikely] to be closed”), but I see no reason to give this hypothesis preferential treatment!
In fact, it seems to be precisely the assertion of the paper’s authors that the first hypothesis should not be taken as a given; and the evidence they give to support this assertion is… the multiple downstream tasks for which an apparent “phase change” did in fact occur. Let’s list them out:
BoolQ (apparent flatline between 2.6B and 13B, then a sudden jump in performance at 175B)
CB (essentially noise between 0.4B and 13B, then a sudden jump in performance at 175B)
RTE (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
WSC (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
basic arithmetic (mostly flat until 6.7B, followed by rapid improvement until 175B)
SquadV2 (apparent flatline at 0.8B, sudden jump at 1.3B followed by approximately constant rate of improvement until 175B)
ANLI round 3 (noise until 13B, sudden jump at 175B)
word-scramble with random insertion (sudden increase in rate of improvement after 6.7B)
Several of the above examples exhibit a substantial amount of noise in their performance graphs, but nonetheless, I feel my point stands. Given this, it seems rather odd for you to be claiming that the “great across-task variance” indicates a lack of general reasoning capability when said across-task variance is (if anything) evidence for the opposite, with many tasks that previously stumped smaller models being overcome by GPT-3.
It’s especially interesting to me that you would write the following, seemingly without realizing the obvious implication (emphasis mine):
we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic)
The takeaway here is, at least in my mind, quite clear: it’s a mistake to evaluate model performance on human terms. Without getting into an extended discussion on whether arithmetic ought to count as a “simple” or “natural” task, empirically transformers do not exhibit a strong affinity for the task. Therefore, the fact that this “basic capability” emerges at all is, or at least should be, strong evidence for generalization capability. As such, the way you use this fact to argue otherwise (both in the section I just quoted and in your original post) seems to me to be exactly backwards.
The ability to get better downstream results is utterly unsurprising: it would be very surprising if language prediction grew steadily toward perfection without a corresponding trend toward good performance on NLP benchmarks
It’s surprising to me that you would write this while also claiming that few-shot prediction seems unlikely to close the gap to fine-tuned models on certain tasks. I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
More generally, this is (again) why I stress the importance of concrete predictions. You call it “utterly unsurprising” that a 175B-param model would outperform smaller ones on NLP benchmarks, and yet neither you nor anyone else could have predicted what the scaling curves for those benchmarks would look like. (Indeed, your entire original post can be read as an expression of surprise at the lack of impressiveness of GPT-3′s performance on certain benchmarks.)
When you only ever look at things in hindsight, without ever setting forth concrete predictions that can be overturned by evidence, you run the risk of never forming a model concrete enough to be engaged with. I don’t believe it’s a coincidence that you called it “difficult” to explain why you found the paper unimpressive: it’s because your standards of impressiveness are opaque enough that they don’t, in and of themselves, constitute a model of how transformers might/might not possess general reasoning ability.
All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?
a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC
Perhaps I wasn’t clear—when I cited my experience as an ML practitioner, I did so in support of a claim about whether the stated capabilities of GPT-3 sound useful, not as a point about what those capabilities are.
I don’t think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.
(I suppose it’s conceivable that few-shot learning with a large model is “secretly useful” in some way not conveyed in the paper, but that’s true of any paper, so if this proves anything then it proves too much.)
Above I argued this question was orthogonal to my point, but to answer it anyway: I’d certainly predict better performance on LM tasks, as a simple extrapolation of the existing “biggening” research (GPT-2 at 1.5B parameters, Megatron-LM at 8.3B, T5 at 11B, T-NLG at 17B).
For downstream tasks, I’d expect similar scaling: certainly with fine-tuning (given T5′s success on SuperGLUE) though GPT-3 was not fine-tuned, and also with unsupervised approaches (zero-shot, few-shot) given the reported scaling of GPT-2 zero-shot with model size (GPT-2 Fig 1).
I also would have predicted that fine-tuning still out-performs unsupervised approaches by a large margin on most tasks, a gap we observe with unsupervised GPT-3 vs. fine-tuned smaller models (presumably comparing to fine-tuned 175B models would yield an even larger gap).
I alluded to all this in the post, as did the GPT-3 authors in their paper: the results demonstrate that existing trends continue up to 175B. As Daniel Kokotajlo says, the new observation confirms an already familiar, though previously untested, prediction.
This comparison seems disingenuous. The goal of the BERT paper was to introduce a novel training method for Transformer-based models that measurably outperformed previous training methods. Conversely, the goal of the GPT-3 paper seems to be to investigate the performance of an existing training method when scaled up to previously unreached (and unreachable) model sizes. I would expect you to agree that these are two very different things, surely?
More generally, it seems to me that you’ve been consistently conflating the practical usefulness of a result with how informative said result is. Earlier, you wrote that “few-shot LM prediction” (not GPT-3 specifically, few-shot prediction in general!) doesn’t sound that promising to you because the specific model discussed in the paper doesn’t outperform SOTA on all benchmarks, and also requires currently impractical levels of hardware/compute. Setting aside the question of whether this original claim resembles the one you just made in your latest response to me (it doesn’t), neither claim addresses what, in my view, are the primary implications of the GPT-3 paper—namely, what it says about the viability of few-shot prediction as model capacity continues to increase.
This, incidentally, is why I issued the “smell test” described in the grandparent, and your answer more or less confirms what I initially suspected: the paper comes across as unsurprising to you because you largely had no concrete predictions to begin with, beyond the trivial prediction that existing trends will persist to some (unknown) degree. (In particular, I didn’t see anything in what you wrote that indicates an overall view of how far the capabilities current language models are from human reasoning ability, and what that might imply about where model performance might start flattening with increased scaling.)
Since it doesn’t appear that you had any intuitions to begin with about what GPT-3′s results might indicate about the scalability of language models in general, it makes sense that your reading of the paper would be framed in terms of practical applications, of which (quite obviously) there are currently none.
This seems like one crux of our disagreement. If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with “few-shot + large LM” as an approach.
I don’t think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either
a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)
The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on “less downstream” tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.
On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.
I find it difficult to express just what I find unimpressive here without further knowledge of your position. (There is an asymmetry: “there is value in this paper” is a there-exists-an-x claim, while “there is no value in this paper” is a for-all-x claim. I’m not arguing for-all-x, only that I have not seen any x yet.)
All I can do is enumerate and strike out all the “x”s I can think of. Does few-shot learning look promising in the scaling limit?
As a tool for humans: no, I expect fine-tuning will always be preferred.
As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
I’m not seeing how you distinguish between the following two hypotheses:
GPT-3 exhibits mostly flat scaling at the tasks you mention underneath your first bullet point (WiC, MultiRC, etc.) because its architecture is fundamentally unsuited to those tasks, such that increasing the model capacity will lead to little further improvement.
Even 175B parameters isn’t sufficient to perform well on certain tasks (given a fixed architecture), but increasing the number of parameters will eventually cause performance on said tasks to undergo a large increase (akin to something like a phase change in physics).
It sounds like you’re implicitly taking the first hypothesis as a given (e.g. when you assert that there is a “remaining gap vs. fine-tuning that seems [unlikely] to be closed”), but I see no reason to give this hypothesis preferential treatment!
In fact, it seems to be precisely the assertion of the paper’s authors that the first hypothesis should not be taken as a given; and the evidence they give to support this assertion is… the multiple downstream tasks for which an apparent “phase change” did in fact occur. Let’s list them out:
BoolQ (apparent flatline between 2.6B and 13B, then a sudden jump in performance at 175B)
CB (essentially noise between 0.4B and 13B, then a sudden jump in performance at 175B)
RTE (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
WSC (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
basic arithmetic (mostly flat until 6.7B, followed by rapid improvement until 175B)
SquadV2 (apparent flatline at 0.8B, sudden jump at 1.3B followed by approximately constant rate of improvement until 175B)
ANLI round 3 (noise until 13B, sudden jump at 175B)
word-scramble with random insertion (sudden increase in rate of improvement after 6.7B)
Several of the above examples exhibit a substantial amount of noise in their performance graphs, but nonetheless, I feel my point stands. Given this, it seems rather odd for you to be claiming that the “great across-task variance” indicates a lack of general reasoning capability when said across-task variance is (if anything) evidence for the opposite, with many tasks that previously stumped smaller models being overcome by GPT-3.
It’s especially interesting to me that you would write the following, seemingly without realizing the obvious implication (emphasis mine):
The takeaway here is, at least in my mind, quite clear: it’s a mistake to evaluate model performance on human terms. Without getting into an extended discussion on whether arithmetic ought to count as a “simple” or “natural” task, empirically transformers do not exhibit a strong affinity for the task. Therefore, the fact that this “basic capability” emerges at all is, or at least should be, strong evidence for generalization capability. As such, the way you use this fact to argue otherwise (both in the section I just quoted and in your original post) seems to me to be exactly backwards.
Elsewhere, you write:
It’s surprising to me that you would write this while also claiming that few-shot prediction seems unlikely to close the gap to fine-tuned models on certain tasks. I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
More generally, this is (again) why I stress the importance of concrete predictions. You call it “utterly unsurprising” that a 175B-param model would outperform smaller ones on NLP benchmarks, and yet neither you nor anyone else could have predicted what the scaling curves for those benchmarks would look like. (Indeed, your entire original post can be read as an expression of surprise at the lack of impressiveness of GPT-3′s performance on certain benchmarks.)
When you only ever look at things in hindsight, without ever setting forth concrete predictions that can be overturned by evidence, you run the risk of never forming a model concrete enough to be engaged with. I don’t believe it’s a coincidence that you called it “difficult” to explain why you found the paper unimpressive: it’s because your standards of impressiveness are opaque enough that they don’t, in and of themselves, constitute a model of how transformers might/might not possess general reasoning ability.
On the reading of the graphs:
All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?
Matt Brockman has closed half the gap for WiC by prompt programming, without any finetuning: http://gptprompts.wikidot.com/linguistics:word-in-context