(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.
(That is, the ‘W’ in AdamW stands for ‘weight decay’, that is, a lasso-like regularization trying to shrink the size of weights and reducing ‘how wiggly’ and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization—and these models are the largest of data. In principle, an informative prior like regularization ought to ‘wash out’ in the limit and not be a problem even if they are wrong, but in practice this doesn’t seem to quite work out, perhaps because these approaches aren’t Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong… “Neural nets want to work” so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.