Lost Futures comments on Google’s new 540 billion parameter language model

Lost Futures 4 Apr 2022 18:12 UTC
3 points
It’s interesting that language model scaling has, for the moment at least, stopped scaling (outside of MoE models). Nearly two years after its release, anything larger than GPT-3 by more than an order of magnitude has yet to be unveiled afaik.
- gwern 4 Apr 2022 19:13 UTC
  51 points
  Parent
  Compute is much more important than mere parameter count* (as MoEs demonstrate and Chinchilla rubs your nose in). Investigating post-GPT-3-compute: https://www.lesswrong.com/posts/sDiGGhpw7Evw7zdR4/compute-trends-comparison-to-openai-s-ai-and-compute https://www.lesswrong.com/posts/XKtybmbjhC6mXDm5z/compute-trends-across-three-eras-of-machine-learning Between Megatron Turing-NLG, Yuan, Jurassic, and Gopher (and an array of smaller ~GPT-3-scale efforts), we look like we’re still on the old scaling trend, just not the hyper-fast scaling trend you could get cherrypicking a few recent points.
  
  * Parameter-count was a useful proxy back when everyone was doing compute-optimal scaling on dense models and training a 173b beat 17b beat 1.7b, but then everyone started dabbling in cheaper models and undertraining models (undertrained even according to the then-known scaling laws), and some entities looked like they were optimizing for headlines rather than capabilities. So it’s better these days to emphasize compute-count. There’s no easy way to cheat petaflop-s/days… yet.
  - lennart 5 Apr 2022 17:32 UTC
    7 points
    Parent
    It’s roughly an order of magnitude more compute than GPT-3.
    ML Model Compute [FLOPs] x GPT-3
    GPT-3 (2020) 3.1e23 1
    Gopher (2021-12) 6.3e23 ≈2x
    Chinchilla (2022-04) 5.8e23 ≈2x
    PaLM (2022-04) 2.5e24 ≈10x
    - gwern 5 Apr 2022 19:04 UTC
      10 points
      Parent
      Which is reasonable. It has been about <2.5 years since GPT-3 was trained (they mention the move to Azure disrupting training, IIRC, which lets you date it earlier than just ‘May 2020’). Under the 3.4 month “AI and Compute” trend, you’d expect 8.8 doublings or the top run now being 445x. I do not think anyone has a 445x run they are about to unveil any second now. Whereas on the slower >5.7-month doubling in that link, you would expect <36x, which is still 3x PaLM’s actual 10x, but at least the right order of magnitude.
      
      There may also be other runs around PaLM scale, pushing peak closer to 30x. (eg Gopher was secret for a long time and a larger Chinchilla would be a logical thing to do and we wouldn’t know until next year, potentially; and no one’s actually computed the total FLOPS for ERNIE-Titan AFAIK, and it may still be running so who knows what it’s up to in total compute consumption. So, 10x from PaLM is the lower bound, and 5 years from now, we may look back and say “ah yes, XYZ nailed the compute-trend exactly, we just didn’t learn about it until recently when they happened to disclose exact numbers.” Somewhat like how some Starcraft predictions were falsified but retroactively turned out to be right because we just didn’t know about AlphaStar and no one had noticed Vinyal’s Blizzard talk implying they were positioned for AlphaStar.)
  - Lost Futures 4 Apr 2022 20:12 UTC
    7 points
    Parent
    Thanks for the explanation Gwern. Goodhart’s law strikes again!
  - Ilverin the Stupid and Offensive 9 Apr 2022 3:36 UTC
    2 points
    Parent
    Maybe https://en.wikipedia.org/wiki/Minifloat is a way to cheat flop metric?
    - Nathan Helm-Burger 9 Apr 2022 18:49 UTC
      3 points
      Parent
      That’s already what TPUs do, basically
    - Pattern 9 Apr 2022 18:53 UTC
      2 points
      Parent
      I think that higher precision isn’t always needed (or used efficiently).
- Dirichlet-to-Neumann 4 Apr 2022 18:58 UTC
  6 points
  Parent
  540 billions parameters is about 3 times more than GPT-3 170 billions, which is consistent with a Moore Law doubling time of about 18 months. I don’t see how this is evidence for language model scalling slowing down.
  - Lost Futures 4 Apr 2022 20:05 UTC
    20 points
    Parent
    As Adam said, trending with Moore’s Law is far slower than the previous trajectory of model scaling. In 2020 after the release of GPT-3, there was widespread speculation that by the next year trillion parameter models would begin to emerge.
  - Adam Scherlis 4 Apr 2022 19:39 UTC
    7 points
    Parent
    Language model parameter counts were growing much faster than 2x/18mo for a while.

ML Model	Compute [FLOPs]	x GPT-3
GPT-3 (2020)	3.1e23	1
Gopher (2021-12)	6.3e23	≈2x
Chinchilla (2022-04)	5.8e23	≈2x
PaLM (2022-04)	2.5e24	≈10x