Have you tried instead ‘skinny’ NNs with a bias towards depth,
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
What do you mean by power here?
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.