gwern comments on [Link] Training Compute-Optimal Large Language Models

gwern 1 Apr 2022 17:42 UTC
12 points
The existing Gopher is a sunk cost. Imagine throwing it away and an intern reporting that some tweaks to a different hyperparameter would save 6% FLOPS but only on models at or past 280b. Would you suddenly go “this changes everything!” Or would you instead say, “yes, good job, but 280b models are very expensive, and there are countless interesting things we can do with 3 175b models trained from scratch, such as doing multilingual or different modalities or multimodal work, and there are even more things we could do with another 17 Chinchillas trained from scratch”? If you are only 6% of the way, then it’s unlikely saving 6% is going to move the needle on any decisions.
- Daniel Kokotajlo 1 Apr 2022 18:47 UTC
  2 points
  Parent
  Ha, good point. But still though—don’t people want to have bigass text models? The bigger the better? The 6% savings is just a cherry on top. It sounds like you don’t; you’d rather have 3 175b’s?
  - gwern 1 Apr 2022 19:43 UTC
    7 points
    Parent
    If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
    
    A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
    - nostalgebraist 1 Apr 2022 20:02 UTC
      4 points
      Parent
      If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
      On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
      - gwern 1 Apr 2022 21:11 UTC
        8 points
        Parent
        I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
        
        * Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.