gwern comments on [Link] Training Compute-Optimal Large Language Models

gwern 1 Apr 2022 16:54 UTC
5 points
Unlikely, because Gopher is so far from what they find optimal. See the table of requirements which helpfully defines compute requirements in terms of “Gophers” (perhaps they were thinking much the same thing). An optimal 280b-parameter model (ie. a Gopher) requires 17.2 Gophers’ worth of compute, or to put it another way, Gopher used only 6% of the compute it should’ve for it to have been an optimal model. You could train almost 3 different 175-billion models from scratch for what it would take to ‘finish’ Gopher (they cost 6.7x Gopher).
- Daniel Kokotajlo 1 Apr 2022 17:06 UTC
  3 points
  Parent
  I don’t see why the conclusion follows from your argument. I assume you are right about how they’d need to keep training Gopher for 17.2X more training steps in order to reach optimal level for 280b-parameter models. Instead they could train 3 different optimal 175b-parameter models. But… maybe they would rather have the former than the latter? If I were in charge, I’d rather have 1 ‘finished’ 280b than 3 finished 175b models.
  - gwern 1 Apr 2022 17:42 UTC
    12 points
    Parent
    The existing Gopher is a sunk cost. Imagine throwing it away and an intern reporting that some tweaks to a different hyperparameter would save 6% FLOPS but only on models at or past 280b. Would you suddenly go “this changes everything!” Or would you instead say, “yes, good job, but 280b models are very expensive, and there are countless interesting things we can do with 3 175b models trained from scratch, such as doing multilingual or different modalities or multimodal work, and there are even more things we could do with another 17 Chinchillas trained from scratch”? If you are only 6% of the way, then it’s unlikely saving 6% is going to move the needle on any decisions.
    - Daniel Kokotajlo 1 Apr 2022 18:47 UTC
      2 points
      Parent
      Ha, good point. But still though—don’t people want to have bigass text models? The bigger the better? The 6% savings is just a cherry on top. It sounds like you don’t; you’d rather have 3 175b’s?
      - gwern 1 Apr 2022 19:43 UTC
        7 points
        Parent
        If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
        
        A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
        nostalgebraist 1 Apr 2022 20:02 UTC
        4 points
        Parent
        If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
        On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
        gwern 1 Apr 2022 21:11 UTC
        8 points
        Parent
        I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
        
        * Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.