But if you keep training, GD should eventually find a low complexity high test scoring solution—if one exists—because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength—if it’s too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.
I think I’m still not understanding. Shouldn’t the implicit regularization strength of SGD be higher, not lower, for fewer iterations? So running it longer should give you a higher-complexity, not a lower-complexity solution. (Although it’s less clear how this intuition pans out once you already have very low training loss, maybe you’re saying that double descent somehow kicks in there?)
I think grokking requires explicit mild regularization (or at least, it’s easier to model how that leads to grokking).
The total objective is training loss + reg term. Initially the training loss totally dominates, and GD pushes that down until it overfits (finding a solution with near 0 training loss balanced against reg penalty). Then GD bounces around on that near 0 training loss surface for a while, trying to also reduce the reg term without increasing the training loss. That’s hard to do, but eventually it can find rare solutions that actually generalize (still allow near 0 training loss at much lower complexity). Those solutions are like narrow holes in that surface.
You can run it as long as you want, but it’s never going to ascend into higher complexity regions than those which enable 0 training loss (model entropy on order data set entropy), the reg term should ensure that.
Okay I think I get what you’re saying now—more SGD steps should increase “effective model capacity”, so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?
I think I’m still not understanding. Shouldn’t the implicit regularization strength of SGD be higher, not lower, for fewer iterations? So running it longer should give you a higher-complexity, not a lower-complexity solution. (Although it’s less clear how this intuition pans out once you already have very low training loss, maybe you’re saying that double descent somehow kicks in there?)
I think grokking requires explicit mild regularization (or at least, it’s easier to model how that leads to grokking).
The total objective is training loss + reg term. Initially the training loss totally dominates, and GD pushes that down until it overfits (finding a solution with near 0 training loss balanced against reg penalty). Then GD bounces around on that near 0 training loss surface for a while, trying to also reduce the reg term without increasing the training loss. That’s hard to do, but eventually it can find rare solutions that actually generalize (still allow near 0 training loss at much lower complexity). Those solutions are like narrow holes in that surface.
You can run it as long as you want, but it’s never going to ascend into higher complexity regions than those which enable 0 training loss (model entropy on order data set entropy), the reg term should ensure that.
Okay I think I get what you’re saying now—more SGD steps should increase “effective model capacity”, so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?