At a cursory level, this seems closely related to Deep Double Descent, but you don’t mention it, which I find surprising; did I pattern-match in error?
Idk, it might be related to double descent? I’m not that convinced.
Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don’t apply here.
They did also find epoch wise (different from data wise, because it’s trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn’t count.
My guess is that the descent part of deep double descent illustrates some underlying competition between different circuits in the model, where some do memorisation and others do generalisation, and there’s some similar competition and switching. Which is cool and interesting and somewhat related! And it totally wouldn’t surprise me if there’s some similar tension between hard to reach but simple and easy to reach but complex.
Re single basin, idk, I actually think it’s a clear disproof of the single basin hypothesis (in this specific case—it could still easily be mostly true for other problems). Here, there’s a solution to modular addition for each of the 56 frequencies! These solutions are analogous, but they are fairly far apart in model space, and definitely can’t be bridged by just permuting the weights. (eg, the embedding picking up on cos(5x) vs cos(18x) is a totally different solution and set of weights, and would require significant non-linear shuffling around of parameters)
You can observe “double descent” of test loss curves in the grokking setting, and there is “grokking” of test set performance as model dimension is increased, as this paper points out
Some potentially naive thoughts/questions:
At a cursory level, this seems closely related to Deep Double Descent, but you don’t mention it, which I find surprising; did I pattern-match in error?
This also seems tangentially related to the single basin hypothesis
Idk, it might be related to double descent? I’m not that convinced.
Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don’t apply here.
They did also find epoch wise (different from data wise, because it’s trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn’t count.
My guess is that the descent part of deep double descent illustrates some underlying competition between different circuits in the model, where some do memorisation and others do generalisation, and there’s some similar competition and switching. Which is cool and interesting and somewhat related! And it totally wouldn’t surprise me if there’s some similar tension between hard to reach but simple and easy to reach but complex.
Re single basin, idk, I actually think it’s a clear disproof of the single basin hypothesis (in this specific case—it could still easily be mostly true for other problems). Here, there’s a solution to modular addition for each of the 56 frequencies! These solutions are analogous, but they are fairly far apart in model space, and definitely can’t be bridged by just permuting the weights. (eg, the embedding picking up on cos(5x) vs cos(18x) is a totally different solution and set of weights, and would require significant non-linear shuffling around of parameters)
You can observe “double descent” of test loss curves in the grokking setting, and there is “grokking” of test set performance as model dimension is increased, as this paper points out