Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.