Not Relevant comments on chinchilla’s wild implications

Not Relevant 3 Aug 2022 21:54 UTC
4 points
0
Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.