If one ignores the “GPT-3” terminology, then yeah, it’s a perfectly decent scaling-up-transformers paper similar to the others that have come out in the last few years. (A paper with some flaws, but that’s not surprising.)
But, I would be very surprised if there isn’t a lot of hype about this paper—hype largely due to the “GPT-3” term, and the inappropriate expectations it sets. People are naturally going to think “GPT-3″ is as much of a step forward as “GPT-2” was, and it isn’t. I take a critical tone here in an effort to cut that hype off at the pass.
I take a critical tone here in an effort to cut that hype off at the pass.
Maybe this is just my AI safety focus, or something, but I find myself annoyed by ‘hype management’ more often than not; I think the underlying root cause of the frustration is that it’s easier to reach agreement on object-level details than interpretations, which are themselves easier than interpretations of interpretations.
Like, when I heard “GPT-3”, I thought “like GPT-2, except one more,” and from what I can tell that expectation is roughly accurate. The post agrees, and notes that since “one” doesn’t correspond to anything here, the main thing this tells you is that this transformer paper came from people who feel like they own the GPT name instead of people who don’t feel that. It sounds like you expected “GPT” to mean something more like “paradigm-breaker” and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.
But under the hype management goal, the question of whether we should celebrate it as “as predicted, larger models continue to perform better, and astoundingly 175B parameters for the amount of training we did still hasn’t converged” or criticize it as “oh, it is a mere confirmation of a prediction widely suspected” isn’t a question of what’s in the paper (as neither disagree), or even your personal take, but what you expect the social distribution of takes is, so that your statement is the right pull on the group beliefs.
---
Maybe putting this another way, when I view this as “nostalgebraist the NLP expert who is following and sharing his own research taste”, I like the post, as expert taste is useful even if you the reader disagree; and when I view it as “nostalgebraist the person who has goals for social epistemology around NLP” I like it less.
I agree with you about hype management in general, I think. The following does seem like a point of concrete disagreement:
It sounds like you expected “GPT” to mean something more like “paradigm-breaker” and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.
If the paper had not done few-shot learning, and had just reviewed LM task performance / generation quality / zero-shot (note that zero-shot scales up well too!), I would agree with you.
However, as I read the paper, it touts few-shot as this new, exciting capability that only properly emerges at the new scale. I expected that, if any given person found the paper impressive, it would be for this purported newness and not only “LM scaling continues,” and this does seem to be the case (e.g. gwern, dxu). So there is a real, object-level dispute over the extent to which this is a qualitative jump.
I’m not sure I have concrete social epistemology goals except “fewer false beliefs” -- that is, I am concerned with group beliefs, but only because they point to which truths will be most impactful to voice. I predicted people would be overly impressed with few-shot, and I wanted to counter that. Arguably I should have concentrated less on “does this deserve the title GPT-3?” and more heavily on few-shot, as I’ve done more recently.
I think if I were you, then, I would have focused more on how we already knew you could scale up transformers and get more and more impressive results. I had heard of (and maybe skimmed) some of those other papers, so I was already somewhat confident that you could scale up transformers and get more impressive results… but I didn’t quite believe it, deep down. Deep down I thought that probably there was going to be some catch or limitation I didn’t know of yet that would prevent this easy scaling from going on much farther, or leading to anything interestingly new. After all, speculation is easy; making predictions and then later confirming them is hard. Well, now it’s confirmed. This doesn’t change my credences that much (maybe they go from 60% to 90% for the “can we scale up langauge models” and from like 20% to 30% for “are within 5 years of some sort of transformative AI”) but it’s changed my gut.
If one ignores the “GPT-3” terminology, then yeah, it’s a perfectly decent scaling-up-transformers paper similar to the others that have come out in the last few years. (A paper with some flaws, but that’s not surprising.)
But, I would be very surprised if there isn’t a lot of hype about this paper—hype largely due to the “GPT-3” term, and the inappropriate expectations it sets. People are naturally going to think “GPT-3″ is as much of a step forward as “GPT-2” was, and it isn’t. I take a critical tone here in an effort to cut that hype off at the pass.
Maybe this is just my AI safety focus, or something, but I find myself annoyed by ‘hype management’ more often than not; I think the underlying root cause of the frustration is that it’s easier to reach agreement on object-level details than interpretations, which are themselves easier than interpretations of interpretations.
Like, when I heard “GPT-3”, I thought “like GPT-2, except one more,” and from what I can tell that expectation is roughly accurate. The post agrees, and notes that since “one” doesn’t correspond to anything here, the main thing this tells you is that this transformer paper came from people who feel like they own the GPT name instead of people who don’t feel that. It sounds like you expected “GPT” to mean something more like “paradigm-breaker” and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.
But under the hype management goal, the question of whether we should celebrate it as “as predicted, larger models continue to perform better, and astoundingly 175B parameters for the amount of training we did still hasn’t converged” or criticize it as “oh, it is a mere confirmation of a prediction widely suspected” isn’t a question of what’s in the paper (as neither disagree), or even your personal take, but what you expect the social distribution of takes is, so that your statement is the right pull on the group beliefs.
---
Maybe putting this another way, when I view this as “nostalgebraist the NLP expert who is following and sharing his own research taste”, I like the post, as expert taste is useful even if you the reader disagree; and when I view it as “nostalgebraist the person who has goals for social epistemology around NLP” I like it less.
I agree with you about hype management in general, I think. The following does seem like a point of concrete disagreement:
If the paper had not done few-shot learning, and had just reviewed LM task performance / generation quality / zero-shot (note that zero-shot scales up well too!), I would agree with you.
However, as I read the paper, it touts few-shot as this new, exciting capability that only properly emerges at the new scale. I expected that, if any given person found the paper impressive, it would be for this purported newness and not only “LM scaling continues,” and this does seem to be the case (e.g. gwern, dxu). So there is a real, object-level dispute over the extent to which this is a qualitative jump.
I’m not sure I have concrete social epistemology goals except “fewer false beliefs” -- that is, I am concerned with group beliefs, but only because they point to which truths will be most impactful to voice. I predicted people would be overly impressed with few-shot, and I wanted to counter that. Arguably I should have concentrated less on “does this deserve the title GPT-3?” and more heavily on few-shot, as I’ve done more recently.
I think if I were you, then, I would have focused more on how we already knew you could scale up transformers and get more and more impressive results. I had heard of (and maybe skimmed) some of those other papers, so I was already somewhat confident that you could scale up transformers and get more impressive results… but I didn’t quite believe it, deep down. Deep down I thought that probably there was going to be some catch or limitation I didn’t know of yet that would prevent this easy scaling from going on much farther, or leading to anything interestingly new. After all, speculation is easy; making predictions and then later confirming them is hard. Well, now it’s confirmed. This doesn’t change my credences that much (maybe they go from 60% to 90% for the “can we scale up langauge models” and from like 20% to 30% for “are within 5 years of some sort of transformative AI”) but it’s changed my gut.