However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.
This seems right and I don’t think we say anything contradicting it in the paper.
I also don’t see how saying ‘different patterns are learned at different speeds’ is supposed to have any explanatory power. It doesn’t explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying ‘bricks fall because it’s in a brick’s nature to move towards the ground’: both are repackaging an observation as an explanation.
The idea is that the framing ‘learning at different speeds’ lets you frame grokking and double descent as the same thing. More like generalizing ‘bricks move towards the ground’ and ‘rocks move towards the ground’ to ‘objects move towards the ground’. I don’t think we make any grand claims about explaining everything in the paper, but I’ll have a look and see if there’s edits I should make—thanks for raising these points.
Broadly agree with the takes here.
This seems right and I don’t think we say anything contradicting it in the paper.
The idea is that the framing ‘learning at different speeds’ lets you frame grokking and double descent as the same thing. More like generalizing ‘bricks move towards the ground’ and ‘rocks move towards the ground’ to ‘objects move towards the ground’. I don’t think we make any grand claims about explaining everything in the paper, but I’ll have a look and see if there’s edits I should make—thanks for raising these points.