Unlikely, because Gopher is so far from what they find optimal. See the table of requirements which helpfully defines compute requirements in terms of “Gophers” (perhaps they were thinking much the same thing). An optimal 280b-parameter model (ie. a Gopher) requires 17.2 Gophers’ worth of compute, or to put it another way, Gopher used only 6% of the compute it should’ve for it to have been an optimal model. You could train almost 3 different 175-billion models from scratch for what it would take to ‘finish’ Gopher (they cost 6.7x Gopher).
I don’t see why the conclusion follows from your argument. I assume you are right about how they’d need to keep training Gopher for 17.2X more training steps in order to reach optimal level for 280b-parameter models. Instead they could train 3 different optimal 175b-parameter models. But… maybe they would rather have the former than the latter? If I were in charge, I’d rather have 1 ‘finished’ 280b than 3 finished 175b models.
The existing Gopher is a sunk cost. Imagine throwing it away and an intern reporting that some tweaks to a different hyperparameter would save 6% FLOPS but only on models at or past 280b. Would you suddenly go “this changes everything!” Or would you instead say, “yes, good job, but 280b models are very expensive, and there are countless interesting things we can do with 3 175b models trained from scratch, such as doing multilingual or different modalities or multimodal work, and there are even more things we could do with another 17 Chinchillas trained from scratch”? If you are only 6% of the way, then it’s unlikely saving 6% is going to move the needle on any decisions.
Ha, good point. But still though—don’t people want to have bigass text models? The bigger the better? The 6% savings is just a cherry on top. It sounds like you don’t; you’d rather have 3 175b’s?
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
How do you then get suboptimal scaling laws based on not using enough data?
It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
I’m wondering: could one just continue training Gopher (the previous bigger model) on the newly added data?
Unlikely, because Gopher is so far from what they find optimal. See the table of requirements which helpfully defines compute requirements in terms of “Gophers” (perhaps they were thinking much the same thing). An optimal 280b-parameter model (ie. a Gopher) requires 17.2 Gophers’ worth of compute, or to put it another way, Gopher used only 6% of the compute it should’ve for it to have been an optimal model. You could train almost 3 different 175-billion models from scratch for what it would take to ‘finish’ Gopher (they cost 6.7x Gopher).
I don’t see why the conclusion follows from your argument. I assume you are right about how they’d need to keep training Gopher for 17.2X more training steps in order to reach optimal level for 280b-parameter models. Instead they could train 3 different optimal 175b-parameter models. But… maybe they would rather have the former than the latter? If I were in charge, I’d rather have 1 ‘finished’ 280b than 3 finished 175b models.
The existing Gopher is a sunk cost. Imagine throwing it away and an intern reporting that some tweaks to a different hyperparameter would save 6% FLOPS but only on models at or past 280b. Would you suddenly go “this changes everything!” Or would you instead say, “yes, good job, but 280b models are very expensive, and there are countless interesting things we can do with 3 175b models trained from scratch, such as doing multilingual or different modalities or multimodal work, and there are even more things we could do with another 17 Chinchillas trained from scratch”? If you are only 6% of the way, then it’s unlikely saving 6% is going to move the needle on any decisions.
Ha, good point. But still though—don’t people want to have bigass text models? The bigger the better? The 6% savings is just a cherry on top. It sounds like you don’t; you’d rather have 3 175b’s?
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
Probably, right? They might have to change the hyperparameters e.g. the learning rate schedule.
I’d imagine they are already doing this.
I would also say “probably”.
But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
Must have been different I suppose.
It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
Don’t you mean the dataset size was much too large for the smaller models and maybe too small for the largest models?