New LM scaling paper from DeepMind (abs, pdf).
Abstract (my emphasis):
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
Brief comments on my blog here.
Presumably has implications for Bio Anchors?
The first-order implication for Bio Anchors is that the number of training datapoints appears to scale linearly with parameter count, rather than in proportion to paramter count ^ 0.8, as estimated in the report. So for example, if you think that TAI models will be 100,000 times larger than current models, then they’ll need 10 times more compute to train than was previously estimated. This pushes out timelines on the order of a few years, to the extent that you put weight on the neural network model.
Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of “oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said.” What do you think?
It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It’s the parameter count of a model that uses about as much inference compute as the brain.)
This is a weird thing about Bio Anchors—it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It’s always waiting for its “sufficiently expensive model” (and it does not care that this model keeps “getting better” in terms of loss/etc as the efficiency improvements roll in).
Anyway, I’d forgotten the prior used for dataset scaling in Bio Anchors, but it’s pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.
I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don’t think this comes close to that, so I think the effect is much smaller.
OK, good to know. I look forward to seeing the performance trends updated with the new scaling paradigm/law.
(In terms of the neural network model, this means lowering our estimate for how many parameters will be needed.)
Something worth reemphasizing for folks not in the field is that these benchmarks are not like usual benchmarks where you train the model on the task, and then see how good it does on a held-out set. Chinchilla was not explicitly trained on any of these problems. It’s typically given some context like: “Q: What is the southernmost continent? A: Antarctica Q: What is the continent north of Africa? A:” and then simply completes the prompt until a stop token is emitted, like a newline character.
And it’s performing above-average-human on these benchmarks.
Thinking back to the “inconsistency” from the Kaplan et al papers...
In Appendix E of the new paper, we see the loss-vs-compute frontier start to “bend” from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
I suspect this bending is the transition from the faster “L(C) law” to the slower “L(D) law.”
A brief recap of that below:
Adding more params can help in two ways: it makes your model’s loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
As models get bigger, the first effect dies off—the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.
Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
If this is correct, then L(D) improves more slowly with data than we had believed.
Note that this does contradict the “use more data!” result from the paper—that is about the relative rate at which N and D affect L(N, D).
I’m wondering: could one just continue training Gopher (the previous bigger model) on the newly added data?
Unlikely, because Gopher is so far from what they find optimal. See the table of requirements which helpfully defines compute requirements in terms of “Gophers” (perhaps they were thinking much the same thing). An optimal 280b-parameter model (ie. a Gopher) requires 17.2 Gophers’ worth of compute, or to put it another way, Gopher used only 6% of the compute it should’ve for it to have been an optimal model. You could train almost 3 different 175-billion models from scratch for what it would take to ‘finish’ Gopher (they cost 6.7x Gopher).
I don’t see why the conclusion follows from your argument. I assume you are right about how they’d need to keep training Gopher for 17.2X more training steps in order to reach optimal level for 280b-parameter models. Instead they could train 3 different optimal 175b-parameter models. But… maybe they would rather have the former than the latter? If I were in charge, I’d rather have 1 ‘finished’ 280b than 3 finished 175b models.
The existing Gopher is a sunk cost. Imagine throwing it away and an intern reporting that some tweaks to a different hyperparameter would save 6% FLOPS but only on models at or past 280b. Would you suddenly go “this changes everything!” Or would you instead say, “yes, good job, but 280b models are very expensive, and there are countless interesting things we can do with 3 175b models trained from scratch, such as doing multilingual or different modalities or multimodal work, and there are even more things we could do with another 17 Chinchillas trained from scratch”? If you are only 6% of the way, then it’s unlikely saving 6% is going to move the needle on any decisions.
Ha, good point. But still though—don’t people want to have bigass text models? The bigger the better? The 6% savings is just a cherry on top. It sounds like you don’t; you’d rather have 3 175b’s?
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
Probably, right? They might have to change the hyperparameters e.g. the learning rate schedule.
I’d imagine they are already doing this.
I would also say “probably”.
But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
Must have been different I suppose.
It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
Don’t you mean the dataset size was much too large for the smaller models and maybe too small for the largest models?
Do they mention context window size somewhere? I couldn’t find it. Though in principle I think it could be computed from the other hyperparameters they list.
It’s strongly implied to be 2048, as in Gopher.
This implies that optimal training of Gopher should have used 16x the data and compute.
It also implies that further scaling will be compute and data only for a while.
All the nice graphs will now get an ugly kink.
All the extrapolations to the human (neocortex) neuron count are off.
Really looking forward to reading the paper.