Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
Have you tried instead ‘skinny’ NNs with a bias towards depth,
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.
Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
What do you mean by power here?
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.