Idk, it might be related to double descent? I’m not that convinced.
Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don’t apply here.
They did also find epoch wise (different from data wise, because it’s trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn’t count.
My guess is that the descent part of deep double descent illustrates some underlying competition between different circuits in the model, where some do memorisation and others do generalisation, and there’s some similar competition and switching. Which is cool and interesting and somewhat related! And it totally wouldn’t surprise me if there’s some similar tension between hard to reach but simple and easy to reach but complex.
Re single basin, idk, I actually think it’s a clear disproof of the single basin hypothesis (in this specific case—it could still easily be mostly true for other problems). Here, there’s a solution to modular addition for each of the 56 frequencies! These solutions are analogous, but they are fairly far apart in model space, and definitely can’t be bridged by just permuting the weights. (eg, the embedding picking up on cos(5x) vs cos(18x) is a totally different solution and set of weights, and would require significant non-linear shuffling around of parameters)
Idk, it might be related to double descent? I’m not that convinced.
Firstly, IMO, the most interesting part of deep double descent is the model size wise/data wise descent, which totally don’t apply here.
They did also find epoch wise (different from data wise, because it’s trained on the same data a bunch), which is more related, but looks like test loss going down, then going up again, then going down. You could argue that grokking has test loss going up, but since it starts at uniform test loss I think this doesn’t count.
My guess is that the descent part of deep double descent illustrates some underlying competition between different circuits in the model, where some do memorisation and others do generalisation, and there’s some similar competition and switching. Which is cool and interesting and somewhat related! And it totally wouldn’t surprise me if there’s some similar tension between hard to reach but simple and easy to reach but complex.
Re single basin, idk, I actually think it’s a clear disproof of the single basin hypothesis (in this specific case—it could still easily be mostly true for other problems). Here, there’s a solution to modular addition for each of the 56 frequencies! These solutions are analogous, but they are fairly far apart in model space, and definitely can’t be bridged by just permuting the weights. (eg, the embedding picking up on cos(5x) vs cos(18x) is a totally different solution and set of weights, and would require significant non-linear shuffling around of parameters)