FWIW, I don’t think the grokking work actually provides a mechanism; the specific setups where you get grokking/double descent are materially different from the setup of say, LLM training. Instead, I think grokking and double descent hint at something more fundamental about how learning works—that there are often “simple”, generalizing solutions in parameter space, but that these solutions require many components of the network to align. Both explicit regularization like weight decay or dropout and implicit regularization like slingshots or SGD favor these solutions after enough data.
Don’t have time to write up my thoughts in more detail, but here’s some other resources you might be interested in:
Besides Neel Nanda’s grokking work (the most recent version of which seems to be on OpenReview here: https://openreview.net/forum?id=9XFSbDPmdW ), here’s a few other relevant recent papers:
Omnigrok: Grokking Beyond Algorithmic Data: Provides significant evidence that grokking happens b/c generalizing solutions (on the algorithmic tasks + MNIST) have much smaller weight norm (which is favored by regularization), but it’s easier to find the high weight norm solutions due to network initializations. The main evidence here is that if you constrain the weight norm of the network sufficiently, you often can have immediate generalization on tasks that normally exhibit grokking.
> Claim 1 (Pattern learning dynamics). Grokking, like epoch-wise double descent, occurs when slow patterns generalize well and are ultimately favored by the training regime, but are preceded by faster patterns which generalize poorly.
And this slightly less uncontroversial claim:
> Claim 2 (Pattern learning as function of EMC). In both grokking and double descent, pattern learning occurs as a function of effective model complexity (EMC) (Nakkiran et al., 2021), a measure of the complexity of a model that integrates model size and training time.
They create a toy two-feature model to explain this in the appendix of the update preprint.
Multi-Component Learning and S-Curves: Creates a toy model of emergence: when the optimal solution consists of the product of several pieces, we’ll often see the same loss curves that we see in practice.
The code to reproduce all four of these papers are available if you want to play around them more.
I also think you might find the other variants of “optimizer is simpler than memorizer” stories for mesa-optimization on LW/AF interesting (though ~all of these predate even Neel’s grokking work?).
FWIW, I don’t think the grokking work actually provides a mechanism; the specific setups where you get grokking/double descent are materially different from the setup of say, LLM training. Instead, I think grokking and double descent hint at something more fundamental about how learning works—that there are often “simple”, generalizing solutions in parameter space, but that these solutions require many components of the network to align. Both explicit regularization like weight decay or dropout and implicit regularization like slingshots or SGD favor these solutions after enough data.
Don’t have time to write up my thoughts in more detail, but here’s some other resources you might be interested in:
Besides Neel Nanda’s grokking work (the most recent version of which seems to be on OpenReview here: https://openreview.net/forum?id=9XFSbDPmdW ), here’s a few other relevant recent papers:
Omnigrok: Grokking Beyond Algorithmic Data: Provides significant evidence that grokking happens b/c generalizing solutions (on the algorithmic tasks + MNIST) have much smaller weight norm (which is favored by regularization), but it’s easier to find the high weight norm solutions due to network initializations. The main evidence here is that if you constrain the weight norm of the network sufficiently, you often can have immediate generalization on tasks that normally exhibit grokking.
Unifying Grokking and Double Descent: (updated preprint here) Makes an explicit connection between Double Descent + Grokking, with the following uncontroversial claim (which ~everyone in the space believes):
> Claim 1 (Pattern learning dynamics). Grokking, like epoch-wise double descent, occurs when slow patterns generalize well and are ultimately favored by the training regime, but are preceded by faster patterns which generalize poorly.
And this slightly less uncontroversial claim:
> Claim 2 (Pattern learning as function of EMC). In both grokking and double descent, pattern learning occurs as a function of effective model complexity (EMC) (Nakkiran et al., 2021), a measure of the complexity of a model that integrates model size and training time.
They create a toy two-feature model to explain this in the appendix of the update preprint.
Multi-Component Learning and S-Curves: Creates a toy model of emergence: when the optimal solution consists of the product of several pieces, we’ll often see the same loss curves that we see in practice.
The code to reproduce all four of these papers are available if you want to play around them more.
You might also be interested in examples of emergence in the LLM literature, e.g. https://arxiv.org/abs/2202.07785 or https://arxiv.org/abs/2206.07682 .
I also think you might find the other variants of “optimizer is simpler than memorizer” stories for mesa-optimization on LW/AF interesting (though ~all of these predate even Neel’s grokking work?).