I’m not seeing how you distinguish between the following two hypotheses:
GPT-3 exhibits mostly flat scaling at the tasks you mention underneath your first bullet point (WiC, MultiRC, etc.) because its architecture is fundamentally unsuited to those tasks, such that increasing the model capacity will lead to little further improvement.
Even 175B parameters isn’t sufficient to perform well on certain tasks (given a fixed architecture), but increasing the number of parameters will eventually cause performance on said tasks to undergo a large increase (akin to something like a phase change in physics).
It sounds like you’re implicitly taking the first hypothesis as a given (e.g. when you assert that there is a “remaining gap vs. fine-tuning that seems [unlikely] to be closed”), but I see no reason to give this hypothesis preferential treatment!
In fact, it seems to be precisely the assertion of the paper’s authors that the first hypothesis should not be taken as a given; and the evidence they give to support this assertion is… the multiple downstream tasks for which an apparent “phase change” did in fact occur. Let’s list them out:
BoolQ (apparent flatline between 2.6B and 13B, then a sudden jump in performance at 175B)
CB (essentially noise between 0.4B and 13B, then a sudden jump in performance at 175B)
RTE (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
WSC (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
basic arithmetic (mostly flat until 6.7B, followed by rapid improvement until 175B)
SquadV2 (apparent flatline at 0.8B, sudden jump at 1.3B followed by approximately constant rate of improvement until 175B)
ANLI round 3 (noise until 13B, sudden jump at 175B)
word-scramble with random insertion (sudden increase in rate of improvement after 6.7B)
Several of the above examples exhibit a substantial amount of noise in their performance graphs, but nonetheless, I feel my point stands. Given this, it seems rather odd for you to be claiming that the “great across-task variance” indicates a lack of general reasoning capability when said across-task variance is (if anything) evidence for the opposite, with many tasks that previously stumped smaller models being overcome by GPT-3.
It’s especially interesting to me that you would write the following, seemingly without realizing the obvious implication (emphasis mine):
we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic)
The takeaway here is, at least in my mind, quite clear: it’s a mistake to evaluate model performance on human terms. Without getting into an extended discussion on whether arithmetic ought to count as a “simple” or “natural” task, empirically transformers do not exhibit a strong affinity for the task. Therefore, the fact that this “basic capability” emerges at all is, or at least should be, strong evidence for generalization capability. As such, the way you use this fact to argue otherwise (both in the section I just quoted and in your original post) seems to me to be exactly backwards.
The ability to get better downstream results is utterly unsurprising: it would be very surprising if language prediction grew steadily toward perfection without a corresponding trend toward good performance on NLP benchmarks
It’s surprising to me that you would write this while also claiming that few-shot prediction seems unlikely to close the gap to fine-tuned models on certain tasks. I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
More generally, this is (again) why I stress the importance of concrete predictions. You call it “utterly unsurprising” that a 175B-param model would outperform smaller ones on NLP benchmarks, and yet neither you nor anyone else could have predicted what the scaling curves for those benchmarks would look like. (Indeed, your entire original post can be read as an expression of surprise at the lack of impressiveness of GPT-3′s performance on certain benchmarks.)
When you only ever look at things in hindsight, without ever setting forth concrete predictions that can be overturned by evidence, you run the risk of never forming a model concrete enough to be engaged with. I don’t believe it’s a coincidence that you called it “difficult” to explain why you found the paper unimpressive: it’s because your standards of impressiveness are opaque enough that they don’t, in and of themselves, constitute a model of how transformers might/might not possess general reasoning ability.
All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?
I’m not seeing how you distinguish between the following two hypotheses:
GPT-3 exhibits mostly flat scaling at the tasks you mention underneath your first bullet point (WiC, MultiRC, etc.) because its architecture is fundamentally unsuited to those tasks, such that increasing the model capacity will lead to little further improvement.
Even 175B parameters isn’t sufficient to perform well on certain tasks (given a fixed architecture), but increasing the number of parameters will eventually cause performance on said tasks to undergo a large increase (akin to something like a phase change in physics).
It sounds like you’re implicitly taking the first hypothesis as a given (e.g. when you assert that there is a “remaining gap vs. fine-tuning that seems [unlikely] to be closed”), but I see no reason to give this hypothesis preferential treatment!
In fact, it seems to be precisely the assertion of the paper’s authors that the first hypothesis should not be taken as a given; and the evidence they give to support this assertion is… the multiple downstream tasks for which an apparent “phase change” did in fact occur. Let’s list them out:
BoolQ (apparent flatline between 2.6B and 13B, then a sudden jump in performance at 175B)
CB (essentially noise between 0.4B and 13B, then a sudden jump in performance at 175B)
RTE (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
WSC (essentially noise until 2.6B, then a sudden shift to very regular improvement until 175B)
basic arithmetic (mostly flat until 6.7B, followed by rapid improvement until 175B)
SquadV2 (apparent flatline at 0.8B, sudden jump at 1.3B followed by approximately constant rate of improvement until 175B)
ANLI round 3 (noise until 13B, sudden jump at 175B)
word-scramble with random insertion (sudden increase in rate of improvement after 6.7B)
Several of the above examples exhibit a substantial amount of noise in their performance graphs, but nonetheless, I feel my point stands. Given this, it seems rather odd for you to be claiming that the “great across-task variance” indicates a lack of general reasoning capability when said across-task variance is (if anything) evidence for the opposite, with many tasks that previously stumped smaller models being overcome by GPT-3.
It’s especially interesting to me that you would write the following, seemingly without realizing the obvious implication (emphasis mine):
The takeaway here is, at least in my mind, quite clear: it’s a mistake to evaluate model performance on human terms. Without getting into an extended discussion on whether arithmetic ought to count as a “simple” or “natural” task, empirically transformers do not exhibit a strong affinity for the task. Therefore, the fact that this “basic capability” emerges at all is, or at least should be, strong evidence for generalization capability. As such, the way you use this fact to argue otherwise (both in the section I just quoted and in your original post) seems to me to be exactly backwards.
Elsewhere, you write:
It’s surprising to me that you would write this while also claiming that few-shot prediction seems unlikely to close the gap to fine-tuned models on certain tasks. I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
More generally, this is (again) why I stress the importance of concrete predictions. You call it “utterly unsurprising” that a 175B-param model would outperform smaller ones on NLP benchmarks, and yet neither you nor anyone else could have predicted what the scaling curves for those benchmarks would look like. (Indeed, your entire original post can be read as an expression of surprise at the lack of impressiveness of GPT-3′s performance on certain benchmarks.)
When you only ever look at things in hindsight, without ever setting forth concrete predictions that can be overturned by evidence, you run the risk of never forming a model concrete enough to be engaged with. I don’t believe it’s a coincidence that you called it “difficult” to explain why you found the paper unimpressive: it’s because your standards of impressiveness are opaque enough that they don’t, in and of themselves, constitute a model of how transformers might/might not possess general reasoning ability.
On the reading of the graphs:
All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?