Quintin Pope comments on QAPR 5: grokking is maybe not that big a deal?

Quintin Pope 4 Aug 2023 1:44 UTC
LW: 3 AF: 1
0
AF
I mean (1). You can see as much in the figure displayed in the linked notebook:
Note the lack of decrease in the val loss.
I only train for 3e4 steps because that’s sufficient to reach generalization with implicit regularization. E.g., here’s the loss graph I get if I set the batch size down to 50:
Setting the learning rate to 7e-2 also allows for generalization within 3e4 steps (though not as stably):
The slingshot effect does take longer than 3e4 steps to generalize:
- Eric J. Michaud 10 Aug 2023 19:45 UTC
  1 point
  0
  Parent
  Huh those batch size and learning rate experiments are pretty interesting!