How come PaLM_opt is smaller than Chinchilla? Isn’t Chinchilla supposed to be Gopher_opt?
See the footnote attached to that sentence.
These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down?
Great question, with a complicated answer.
First, one of the assumptions you’re making is not quite right. By “trained differently” I imagine you’re referring to a difference in learning rate schedules, since that was the fundamental difference between the earlier scaling papers (Kaplan et al) and the Chinchilla paper (Hoffmann et al).
Then, it sounds like you’re imagining:
Kaplan et al chose learning rate schedules in a particular way
Models like GPT-3 and Gopher did learning rate schedules in the same way, so they got the same scaling law
Hoffmann et al chose their learning rate schedules in a different way from previous authors, so they got a different scaling law
But (2) here is not true. Kaplan et al chose their schedules in an unusual way that doesn’t adapt to the number of training steps, while in practice (and in GPT-3, etc.) people always adapt their schedules to the number of steps like Hoffmann et al do.
“Wait,” you say—“if that’s true, then shouldn’t GPT-3 and Gopher agree with the Hoffmann et al law, not the Kaplan et al law? Why didn’t those papers observe a breakdown in the Kaplan et al law?”
Well, one of the implications of the Kaplan et al law is that for compute-optimal training, you should spent basically all your marginal compute on larger models, while increasing the number of training tokens (batch size * steps) more slowly.
Following this rule, people kept training on ~300B tokens or so, while raising N with compute. So when they plotted loss-vs.-compute, they were effectively just plotting loss-vs.-N.
But if you’re just looking at loss-vs.-N for a constant number of training tokens, and that number is reasonably close to the one Kaplan et al used to set their LR schedule (so that yours is close to theirs) -- then Kaplan et al law is a lot, uh, less wrong.
The problem with the Kaplan law was an incorrect estimate of how loss varied with steps/data. And as a result, picking param/step/data combinations that were suboptimal given a compute budget.
But if you follow its suboptimal recommendations, they tell you not to vary steps/data much. The law is wrong about what happens if you vary steps/data, but it also tells you not to do that, so you won’t notice it being wrong.
Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that’s not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.
See the footnote attached to that sentence.
Great question, with a complicated answer.
First, one of the assumptions you’re making is not quite right. By “trained differently” I imagine you’re referring to a difference in learning rate schedules, since that was the fundamental difference between the earlier scaling papers (Kaplan et al) and the Chinchilla paper (Hoffmann et al).
Then, it sounds like you’re imagining:
Kaplan et al chose learning rate schedules in a particular way
Models like GPT-3 and Gopher did learning rate schedules in the same way, so they got the same scaling law
Hoffmann et al chose their learning rate schedules in a different way from previous authors, so they got a different scaling law
But (2) here is not true. Kaplan et al chose their schedules in an unusual way that doesn’t adapt to the number of training steps, while in practice (and in GPT-3, etc.) people always adapt their schedules to the number of steps like Hoffmann et al do.
“Wait,” you say—“if that’s true, then shouldn’t GPT-3 and Gopher agree with the Hoffmann et al law, not the Kaplan et al law? Why didn’t those papers observe a breakdown in the Kaplan et al law?”
Well, one of the implications of the Kaplan et al law is that for compute-optimal training, you should spent basically all your marginal compute on larger models, while increasing the number of training tokens (batch size * steps) more slowly.
Following this rule, people kept training on ~300B tokens or so, while raising N with compute. So when they plotted loss-vs.-compute, they were effectively just plotting loss-vs.-N.
But if you’re just looking at loss-vs.-N for a constant number of training tokens, and that number is reasonably close to the one Kaplan et al used to set their LR schedule (so that yours is close to theirs) -- then Kaplan et al law is a lot, uh, less wrong.
The problem with the Kaplan law was an incorrect estimate of how loss varied with steps/data. And as a result, picking param/step/data combinations that were suboptimal given a compute budget.
But if you follow its suboptimal recommendations, they tell you not to vary steps/data much. The law is wrong about what happens if you vary steps/data, but it also tells you not to do that, so you won’t notice it being wrong.
Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that’s not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.