I’m not sure I get what the relation would be—double descent is usually with respect to the model size (vs. amout of data), although there is some work on double descent vs. number of training iterations e.g. https://arxiv.org/abs/1912.02292. But I don’t immediately see how to connect this to grokking.
(I agree they might be connected, I’m just saying I don’t see how to show this. I’m very interested in models that can explain grokking, so if you have ideas let me know!)
It makes sense that GD will first find high complexity overfit solutions for an overcomplete model—they are most of the high test scoring solution space. But if you keep training, GD should eventually find a low complexity high test scoring solution—if one exists—because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength—if it’s too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.
I was also under the impression that double descent is basically the same thing, but viewed from the model complexity dimension. Initially in the under-parameterized regime validation error decreases with model complexity up to a saturation point just below where it can start to memorize/overfit, then increases up to a 2nd worse overfitting saturation point, then eventually starts to decrease again heading into the strongly overparameterized regime (assuming appropriate mild regularization).
In the strongly overparameterized regime 2 things are happening: firstly it allows the model capacity to more easily represent a distribution of solutions rather than a single solution, and it also effectively speeds up learning in proportion by effectively evaluating more potential solutions (lottery tickets) per step. Grokking can then occur, as it requires sufficient overpamaterization (whereas in the underparameterized regime there isn’t enough capacity to simultaneously represent a sufficient distribution of solutions to smoothly interpolate and avoid getting stuck in local minima)
Looking at it another way: increased model complexity has strong upside that scales nearly unbounded with model complexity, coupled with the single downside of overfitting which saturates at around data memoriation complexity.
But if you keep training, GD should eventually find a low complexity high test scoring solution—if one exists—because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength—if it’s too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.
I think I’m still not understanding. Shouldn’t the implicit regularization strength of SGD be higher, not lower, for fewer iterations? So running it longer should give you a higher-complexity, not a lower-complexity solution. (Although it’s less clear how this intuition pans out once you already have very low training loss, maybe you’re saying that double descent somehow kicks in there?)
I think grokking requires explicit mild regularization (or at least, it’s easier to model how that leads to grokking).
The total objective is training loss + reg term. Initially the training loss totally dominates, and GD pushes that down until it overfits (finding a solution with near 0 training loss balanced against reg penalty). Then GD bounces around on that near 0 training loss surface for a while, trying to also reduce the reg term without increasing the training loss. That’s hard to do, but eventually it can find rare solutions that actually generalize (still allow near 0 training loss at much lower complexity). Those solutions are like narrow holes in that surface.
You can run it as long as you want, but it’s never going to ascend into higher complexity regions than those which enable 0 training loss (model entropy on order data set entropy), the reg term should ensure that.
Okay I think I get what you’re saying now—more SGD steps should increase “effective model capacity”, so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?
I’m not sure I get what the relation would be—double descent is usually with respect to the model size (vs. amout of data), although there is some work on double descent vs. number of training iterations e.g. https://arxiv.org/abs/1912.02292. But I don’t immediately see how to connect this to grokking.
(I agree they might be connected, I’m just saying I don’t see how to show this. I’m very interested in models that can explain grokking, so if you have ideas let me know!)
(That arxiv link isn’t working btw.)
It makes sense that GD will first find high complexity overfit solutions for an overcomplete model—they are most of the high test scoring solution space. But if you keep training, GD should eventually find a low complexity high test scoring solution—if one exists—because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength—if it’s too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.
I was also under the impression that double descent is basically the same thing, but viewed from the model complexity dimension. Initially in the under-parameterized regime validation error decreases with model complexity up to a saturation point just below where it can start to memorize/overfit, then increases up to a 2nd worse overfitting saturation point, then eventually starts to decrease again heading into the strongly overparameterized regime (assuming appropriate mild regularization).
In the strongly overparameterized regime 2 things are happening: firstly it allows the model capacity to more easily represent a distribution of solutions rather than a single solution, and it also effectively speeds up learning in proportion by effectively evaluating more potential solutions (lottery tickets) per step. Grokking can then occur, as it requires sufficient overpamaterization (whereas in the underparameterized regime there isn’t enough capacity to simultaneously represent a sufficient distribution of solutions to smoothly interpolate and avoid getting stuck in local minima)
Looking at it another way: increased model complexity has strong upside that scales nearly unbounded with model complexity, coupled with the single downside of overfitting which saturates at around data memoriation complexity.
I think I’m still not understanding. Shouldn’t the implicit regularization strength of SGD be higher, not lower, for fewer iterations? So running it longer should give you a higher-complexity, not a lower-complexity solution. (Although it’s less clear how this intuition pans out once you already have very low training loss, maybe you’re saying that double descent somehow kicks in there?)
I think grokking requires explicit mild regularization (or at least, it’s easier to model how that leads to grokking).
The total objective is training loss + reg term. Initially the training loss totally dominates, and GD pushes that down until it overfits (finding a solution with near 0 training loss balanced against reg penalty). Then GD bounces around on that near 0 training loss surface for a while, trying to also reduce the reg term without increasing the training loss. That’s hard to do, but eventually it can find rare solutions that actually generalize (still allow near 0 training loss at much lower complexity). Those solutions are like narrow holes in that surface.
You can run it as long as you want, but it’s never going to ascend into higher complexity regions than those which enable 0 training loss (model entropy on order data set entropy), the reg term should ensure that.
Okay I think I get what you’re saying now—more SGD steps should increase “effective model capacity”, so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?