All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
I can’t think of a coherent model where both of these claims are simultaneously true; if you have one, I’d certainly be interested in hearing what it is.
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?
On the reading of the graphs:
All I can say is “I read them differently and I don’t think further discussion of the ‘right’ way to read them would be productive.”
Something that might make my perspective clear:
when I first read this comment, I thought “whoa, that ‘phase change’ point seems fair and important, maybe I just wasn’t looking for that in the graphs”
and then I went back and looked at the graphs and thought “oh, no, that’s obviously not distinguishable from noise; that’s the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise,” etc. etc.
I don’t expect this will convince you I’m right, but the distance here seems more about generic “how to interpret plots in papers” stuff than anything interesting about GPT-3.
On this:
Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them “less noisily” as the scale grows.
The intended connotation of my stance that “fine-tuning will outperform few-shot” is not “haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!” If anything, it’s the opposite:
I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
I think fine-tuning has shown itself to be a remarkably effective way to “get at” this knowledge for downstream tasks—even with small data sets, not far in scale from the “data sets” used in few-shot.
So, I don’t understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).
Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.
So, this all feels a bit like being a dog owner who reads a new paper “demonstrating dogs’ capacity for empathy with humans,” is unimpressed w/ it’s methodology, and finds themselves arguing over what concrete model of “dog empathy” they hold and what it predicts for the currently popular “dog empathy” proxy metrics, with a background assumption that they’re some sort of dog-empathy-skeptic.
When in fact—they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.
I’ve already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I’ve seen them do that kind of thing in real life . . . should I be impressed?