gwern comments on Gradient surfing: the hidden role of regularization

gwern 5 Jun 2024 0:11 UTC
LW: 3 AF: 2
0
AF

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
- LawrenceC 7 Jun 2024 3:14 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Have you tried instead ‘skinny’ NNs with a bias towards depth,
  I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
  That being said, you’re probably correct that having more layers does seem related to slingshots.
  (Particularly for MLPs, which are notorious for overfitting due to their power.)
  What do you mean by power here?
  - gwern 7 Jun 2024 19:54 UTC
    LW: 4 AF: 4
    0
    AF Parent
    
    What do you mean by power here?
    
    Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.