LawrenceC comments on Explaining grokking through circuit efficiency

LawrenceC 26 Apr 2024 10:54 UTC
LW: 2 AF: 2
0
AF
My speculation for Omni-Grok in particular is that in settings like MNIST you already have two of the ingredients for grokking (that there are both memorising and generalising solutions, and that the generalising solution is more efficient), and then having large parameter norms at initialisation provides the third ingredient (generalising solutions are learned more slowly), for some reason I still don’t know.
Higher weight norm means lower effective learning rate with Adam, no? In that paper they used a constant learning rate across weight norms, but Adam tries to normalize the gradients to be of size 1 per paramter, regardless of the size of the weights. So the weights change more slowly with larger initializations (especially since they constrain the weights to be of fixed norm by projecting after the Adam step).
- Rohin Shah 26 Apr 2024 16:09 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?
  Perhaps under normal circumstances both are learned so fast that you just don’t notice that one is slower than the other, and this slows both of them down enough that you can see the difference?