I chose the grokking starting point as 300 steps, based on the yellow plot. I’d say it’s reasonable to say that ‘grokking is complete’ by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale
Ah, I just looked at your plots, verified that the grokking indeed still happened with 5x and 10x learning rates, and then just assumed 10x faster convergence in the original plots in the post. Apparently that reasoning was wrong. Presumably you’re using different hyperparameters than the ones used in this post? You seem to have faster grokking in the “default setting” than the in the plots shown in the post.
(And it does look like, given some default setting, “10x faster convergence” is basically right, since in your case 10x higher LR makes the grokking stage go from 1700 steps to 150 steps.)
(Partly the issue was that I wasn’t sure whether the x-axis in your plots was starting from the beginning of training, or from the point that grokking started, so I instead reasoned about the impact on the graphs in this post. Though looking at the LR plot it’s now obvious that it’s from the beginning of training.)
I now think this is relatively strong evidence for my view, given that grokking happens pretty quickly (~a third of total training), though it probably is still decently slower than the memorization. (Do you happen to have the training loss curves, so we can estimate how long it takes to memorize under your hyperparameters?)
First, I’d like to note that I don’t see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what’s gaining on inside the network, it wouldn’t be surprising if raising the learning rate increased convergence.
Also, I think what’s actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.
Here is the accuracy curve with the default learning rate:
Here is the curve with 10x learning rate:
And here is the curve with 50x learning rate:
Note that increasing the learning rate doesn’t consistently increase validation convergence. The 50x run does reach convergence faster, but the 10x run doesn’t even reach it at all.
In fact, increasing the learning rate causes the training accuracy to fall to the validation accuracy, after which they begin to increase together (at least for a while). For the 10x increase, the training accuracy quickly diverges from the validation accuracy. In the 50x run, the training and validation accuracies move in tandem throughout the run.
Frederik’s results are broadly similar. If you mouse over the accuracy and loss graphs, you’ll see that
Training performance drops significantly immediately after the learning rate increases.
The losses and accuracies of the “5x” and “10x” lines correlate together pretty well between training/validation. In contrast, the losses and accuracies of the “default” lines don’t correlate strongly between training and testing.
I think that increasing the learning rate after memorization causes some sort of “mode shift” in the training process. It goes from:
First, learn shallow patterns that strongly overfit to the training data, then learn general patterns.
to:
Immediately learn general patterns that perform about equally well on the training and validation data.
In the case of my 10x run, I think it actually has two mode transitions, first from “shallow first” to “immediately general”, then another transition back to “shallow first”, and that’s why you see the training accuracy diverge from the validation accuracy again.
I think results like these make a certain amount of sense, given that higher learning rates are associated with better generalization in more standard settings.
Regardless of what’s gaining on inside the network, it wouldn’t be surprising if raising the learning rate increased convergence.
I’m kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can’t 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.)
Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I’m a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps).
I do think this suggests that you can’t in fact “just 10x the learning rate” once grokking starts, which seems like a hit to my story.
I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.
There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn’t properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.
Huh, intriguing. Yeah, it might be worth running with a non-commutative function and seeing if it holds up—it seems like in the default setting the validation accuracy hits almost 0.5 once the training accuracy is 1, which is about what you’d get if you understood commutativity but nothing else about the function. So the “grokking” part is probably happening after that, i.e. at roughly the 1.5k steps location in the default setting.
Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post.
(This does depend on when you start the counter for grokking, as there’s a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)
Oh I thought figure 1 was S5 but it actually is modular division. I’ll give that a go..
Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)
So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?
Yep I used my own re-implementation, which somehow has slightly different behavior.
I’ll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.
I’m not sure I understand.
I chose the grokking starting point as 300 steps, based on the yellow plot. I’d say it’s reasonable to say that ‘grokking is complete’ by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale
Ah, I just looked at your plots, verified that the grokking indeed still happened with 5x and 10x learning rates, and then just assumed 10x faster convergence in the original plots in the post. Apparently that reasoning was wrong. Presumably you’re using different hyperparameters than the ones used in this post? You seem to have faster grokking in the “default setting” than the in the plots shown in the post.
(And it does look like, given some default setting, “10x faster convergence” is basically right, since in your case 10x higher LR makes the grokking stage go from 1700 steps to 150 steps.)
(Partly the issue was that I wasn’t sure whether the x-axis in your plots was starting from the beginning of training, or from the point that grokking started, so I instead reasoned about the impact on the graphs in this post. Though looking at the LR plot it’s now obvious that it’s from the beginning of training.)
I now think this is relatively strong evidence for my view, given that grokking happens pretty quickly (~a third of total training), though it probably is still decently slower than the memorization. (Do you happen to have the training loss curves, so we can estimate how long it takes to memorize under your hyperparameters?)
First, I’d like to note that I don’t see why faster convergence after changing the learning rate support either story. After initial memorization, the loss decreases by ~3 OOM. Regardless of what’s gaining on inside the network, it wouldn’t be surprising if raising the learning rate increased convergence.
Also, I think what’s actually going on here is weirder than either of our interpretations. I ran experiments where I kept the learning rate the same for the first 1000 steps, then increased it by 10x and 50x for the rest of the training.
Here is the accuracy curve with the default learning rate:
Here is the curve with 10x learning rate:
And here is the curve with 50x learning rate:
Note that increasing the learning rate doesn’t consistently increase validation convergence. The 50x run does reach convergence faster, but the 10x run doesn’t even reach it at all.
In fact, increasing the learning rate causes the training accuracy to fall to the validation accuracy, after which they begin to increase together (at least for a while). For the 10x increase, the training accuracy quickly diverges from the validation accuracy. In the 50x run, the training and validation accuracies move in tandem throughout the run.
Frederik’s results are broadly similar. If you mouse over the accuracy and loss graphs, you’ll see that
Training performance drops significantly immediately after the learning rate increases.
The losses and accuracies of the “5x” and “10x” lines correlate together pretty well between training/validation. In contrast, the losses and accuracies of the “default” lines don’t correlate strongly between training and testing.
I think that increasing the learning rate after memorization causes some sort of “mode shift” in the training process. It goes from:
First, learn shallow patterns that strongly overfit to the training data, then learn general patterns.
to:
Immediately learn general patterns that perform about equally well on the training and validation data.
In the case of my 10x run, I think it actually has two mode transitions, first from “shallow first” to “immediately general”, then another transition back to “shallow first”, and that’s why you see the training accuracy diverge from the validation accuracy again.
I think results like these make a certain amount of sense, given that higher learning rates are associated with better generalization in more standard settings.
I’m kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can’t 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.)
Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I’m a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps).
I do think this suggests that you can’t in fact “just 10x the learning rate” once grokking starts, which seems like a hit to my story.
I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.
There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn’t properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.
Huh, intriguing. Yeah, it might be worth running with a non-commutative function and seeing if it holds up—it seems like in the default setting the validation accuracy hits almost 0.5 once the training accuracy is 1, which is about what you’d get if you understood commutativity but nothing else about the function. So the “grokking” part is probably happening after that, i.e. at roughly the 1.5k steps location in the default setting.
So I ran some experiments for the permutation group S_5 with the task x o y = ?
Interestingly here increasing the learning rate just never works. I’m very confused.
Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post.
(This does depend on when you start the counter for grokking, as there’s a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)
Oh I thought figure 1 was S5 but it actually is modular division. I’ll give that a go..
Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)
So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?
Yeah, that seems right, I think I’m basically at “no, you can’t just 10x the learning rate once grokking starts”.
Increasing regularization (weight decay in this instance) might rescue the ones which don’t work.
I tried increasing weight decay and increased batch sizes but so far no real success compared to 5x lr. Not going to investigate this further atm.
Yep I used my own re-implementation, which somehow has slightly different behavior.
I’ll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.