Quintin Pope comments on Hypothesis: gradient descent prefers general circuits

Quintin Pope 9 Feb 2022 3:25 UTC
1 point
As I describe above, I think grokking requires some mechanism to disrupt the shallow patterns so general patterns can take their place. Regularization / weight decay both do this, but so does the stochasticity of SGD steps. Grokking still happens with no regularization and full batch updates, but the level of grokking is greatly reduced. In that case, I suspect that non-optimal sizes for the update steps of each parameter act as a type of very weak stochasticity. Possibly, perfectly tuning the learning rate for each parameter on each step would prevent grokking.