LawrenceC comments on Gradient surfing: the hidden role of regularization

LawrenceC 7 Jun 2024 3:14 UTC
LW: 2 AF: 2
0
AF
Have you tried instead ‘skinny’ NNs with a bias towards depth,
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
What do you mean by power here?
- gwern 7 Jun 2024 19:54 UTC
  LW: 4 AF: 4
  0
  AF Parent
  
  What do you mean by power here?
  
  Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.