It’s interesting that language model scaling has, for the moment at least, stopped scaling (outside of MoE models). Nearly two years after its release, anything larger than GPT-3 by more than an order of magnitude has yet to be unveiled afaik.
* Parameter-count was a useful proxy back when everyone was doing compute-optimal scaling on dense models and training a 173b beat 17b beat 1.7b, but then everyone started dabbling in cheaper models and undertraining models (undertrained even according to the then-known scaling laws), and some entities looked like they were optimizing for headlines rather than capabilities. So it’s better these days to emphasize compute-count. There’s no easy way to cheat petaflop-s/days… yet.
Which is reasonable. It has been about <2.5 years since GPT-3 was trained (they mention the move to Azure disrupting training, IIRC, which lets you date it earlier than just ‘May 2020’). Under the 3.4 month “AI and Compute” trend, you’d expect 8.8 doublings or the top run now being 445x. I do not think anyone has a 445x run they are about to unveil any second now. Whereas on the slower >5.7-month doubling in that link, you would expect <36x, which is still 3x PaLM’s actual 10x, but at least the right order of magnitude.
There may also be other runs around PaLM scale, pushing peak closer to 30x. (eg Gopher was secret for a long time and a larger Chinchilla would be a logical thing to do and we wouldn’t know until next year, potentially; and no one’s actually computed the total FLOPS for ERNIE-Titan AFAIK, and it may still be running so who knows what it’s up to in total compute consumption. So, 10x from PaLM is the lower bound, and 5 years from now, we may look back and say “ah yes, XYZ nailed the compute-trend exactly, we just didn’t learn about it until recently when they happened to disclose exact numbers.” Somewhat like how some Starcraft predictions were falsified but retroactively turned out to be right because we just didn’t know about AlphaStar and no one had noticed Vinyal’s Blizzard talk implying they were positioned for AlphaStar.)
540 billions parameters is about 3 times more than GPT-3 170 billions, which is consistent with a Moore Law doubling time of about 18 months. I don’t see how this is evidence for language model scalling slowing down.
As Adam said, trending with Moore’s Law is far slower than the previous trajectory of model scaling. In 2020 after the release of GPT-3, there was widespread speculation that by the next year trillion parameter models would begin to emerge.
It’s interesting that language model scaling has, for the moment at least, stopped scaling (outside of MoE models). Nearly two years after its release, anything larger than GPT-3 by more than an order of magnitude has yet to be unveiled afaik.
Compute is much more important than mere parameter count* (as MoEs demonstrate and Chinchilla rubs your nose in). Investigating post-GPT-3-compute: https://www.lesswrong.com/posts/sDiGGhpw7Evw7zdR4/compute-trends-comparison-to-openai-s-ai-and-compute https://www.lesswrong.com/posts/XKtybmbjhC6mXDm5z/compute-trends-across-three-eras-of-machine-learning Between Megatron Turing-NLG, Yuan, Jurassic, and Gopher (and an array of smaller ~GPT-3-scale efforts), we look like we’re still on the old scaling trend, just not the hyper-fast scaling trend you could get cherrypicking a few recent points.
* Parameter-count was a useful proxy back when everyone was doing compute-optimal scaling on dense models and training a 173b beat 17b beat 1.7b, but then everyone started dabbling in cheaper models and undertraining models (undertrained even according to the then-known scaling laws), and some entities looked like they were optimizing for headlines rather than capabilities. So it’s better these days to emphasize compute-count. There’s no easy way to cheat petaflop-s/days… yet.
It’s roughly an order of magnitude more compute than GPT-3.
Which is reasonable. It has been about <2.5 years since GPT-3 was trained (they mention the move to Azure disrupting training, IIRC, which lets you date it earlier than just ‘May 2020’). Under the 3.4 month “AI and Compute” trend, you’d expect 8.8 doublings or the top run now being 445x. I do not think anyone has a 445x run they are about to unveil any second now. Whereas on the slower >5.7-month doubling in that link, you would expect <36x, which is still 3x PaLM’s actual 10x, but at least the right order of magnitude.
There may also be other runs around PaLM scale, pushing peak closer to 30x. (eg Gopher was secret for a long time and a larger Chinchilla would be a logical thing to do and we wouldn’t know until next year, potentially; and no one’s actually computed the total FLOPS for ERNIE-Titan AFAIK, and it may still be running so who knows what it’s up to in total compute consumption. So, 10x from PaLM is the lower bound, and 5 years from now, we may look back and say “ah yes, XYZ nailed the compute-trend exactly, we just didn’t learn about it until recently when they happened to disclose exact numbers.” Somewhat like how some Starcraft predictions were falsified but retroactively turned out to be right because we just didn’t know about AlphaStar and no one had noticed Vinyal’s Blizzard talk implying they were positioned for AlphaStar.)
Thanks for the explanation Gwern. Goodhart’s law strikes again!
Maybe https://en.wikipedia.org/wiki/Minifloat is a way to cheat flop metric?
That’s already what TPUs do, basically
I think that higher precision isn’t always needed (or used efficiently).
540 billions parameters is about 3 times more than GPT-3 170 billions, which is consistent with a Moore Law doubling time of about 18 months. I don’t see how this is evidence for language model scalling slowing down.
As Adam said, trending with Moore’s Law is far slower than the previous trajectory of model scaling. In 2020 after the release of GPT-3, there was widespread speculation that by the next year trillion parameter models would begin to emerge.
Language model parameter counts were growing much faster than 2x/18mo for a while.