ESRogs comments on Mesa-Optimizers via Grokking

ESRogs 7 Dec 2022 5:20 UTC
4 points
0
After reading through the Unifying Grokking and Double Descent paper that LawrenceC linked, it sounds like I’m mostly saying the same thing as what’s in the paper.
(Not too surprising, since I had just read Lawrence’s comment, which summarizes the paper, when I made mine.)
In particular, the paper describes Type 1, Type 2, and Type 3 patterns, which correspond to my easy-to-discover patterns, memorizations, and hard-to-discover patterns:
In our model of grokking and double descent, there are three types of patterns learned at different
speeds. Type 1 patterns are fast and generalize well (heuristics). Type 2 patterns are fast, though
slower than Type 1, and generalize poorly (overfitting). Type 3 patterns are slow and generalize well.
The one thing I mention above that I don’t see in the paper is an explanation for why the Type 2 patterns would be intermediate in learnability between Type 1 and Type 3 patterns or why there would be a regime where they dominate (resulting in overfitting).
My proposed explanation is that, for any given task, the exact mappings from input to output will tend to have a characteristic complexity, which means that they will have a relatively narrow distribution of learnability. And that’s why models will often hit a regime where they’re mostly finding those patterns rather than Type 1, easy-to-learn heuristics (which they’ve exhausted) or Type 3, hard-to-learn rules (which they’re not discovering yet).
The authors do have an appendix section A.1 in the paper with the heading, “Heuristics, Memorization, and Slow Well-Generalizing”, but with “[TODO]”s in the text. Will be curious to see if they end up saying something similar to this point (about input-output memorizations tending to have a characteristic complexity) there.