Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.
Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Great..
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
Also, please check out my comment on your Scaling Laws bibliography page when you get a chance.
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.