“If one looks at the performance of particular tasks, such as arithmetic on numbers of a certain size, across model sizes, one often observes points where larger models discontinuously become better at a task.”
Is it accurate to say that one “often observes” this? The only examples I know of are in GPT-3 with the addition, multiplication, and symbolic substitution tasks. I’m not sure how concerned to be about this being a general phenomena. Does anyone have further examples? Does anyone have insights into whether the GPT-3 examples are special cases or not?
In addition to the original Brown et al 2020 examples, text style transfer, meta-learning instructability*, RL-finetuning of summarization, self-critique of math word problems, and maybe the improving zero-shot translation & program writing/dialogue (I’d have to double-check those), have been shown with GPT-3 and LamDA to ‘kick in’ at certain sizes going from the O(1b) models to 10-1000b. Nobody seems very surprised these days to see something work on GPT-3-173b but then not on ~1b.
* Should we count all of the examples of meta-learning / generalization which require diverse environments to get abruptly better performance out of sample, like XLand or the MuZero meta-learning paper I mention over in EfficientZero? That’s definitely a stark jump in performance: the single-environment agents, no matter how good in the primary environment, typically perform extremely poorly or even near floor in the new environment.
Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.
In the AlphaZero interpretability paper [1], CTRL+F “Ruy Lopez” for an example where the model’s progress was much faster than human progress in quality.
That’s within-training by epoch/iteration, not across trained models by total size/compute. It’s not clear that they are at all the same sort of thing, because you can get spikes trivially by things like the learning rate dropping. Investigating whether there is any connection would be interesting.
Is it accurate to say that one “often observes” this? The only examples I know of are in GPT-3 with the addition, multiplication, and symbolic substitution tasks. I’m not sure how concerned to be about this being a general phenomena. Does anyone have further examples? Does anyone have insights into whether the GPT-3 examples are special cases or not?
In addition to the original Brown et al 2020 examples, text style transfer, meta-learning instructability*, RL-finetuning of summarization, self-critique of math word problems, and maybe the improving zero-shot translation & program writing/dialogue (I’d have to double-check those), have been shown with GPT-3 and LamDA to ‘kick in’ at certain sizes going from the O(1b) models to 10-1000b. Nobody seems very surprised these days to see something work on GPT-3-173b but then not on ~1b.
* Should we count all of the examples of meta-learning / generalization which require diverse environments to get abruptly better performance out of sample, like XLand or the MuZero meta-learning paper I mention over in EfficientZero? That’s definitely a stark jump in performance: the single-environment agents, no matter how good in the primary environment, typically perform extremely poorly or even near floor in the new environment.
Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Great..
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
Also, please check out my comment on your Scaling Laws bibliography page when you get a chance.
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.
In the AlphaZero interpretability paper [1], CTRL+F “Ruy Lopez” for an example where the model’s progress was much faster than human progress in quality.
[1] https://arxiv.org/pdf/2111.09259.pdf
That’s within-training by epoch/iteration, not across trained models by total size/compute. It’s not clear that they are at all the same sort of thing, because you can get spikes trivially by things like the learning rate dropping. Investigating whether there is any connection would be interesting.