Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that’s not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.
Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that’s not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.