Regardless of what’s gaining on inside the network, it wouldn’t be surprising if raising the learning rate increased convergence.
I’m kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can’t 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.)
Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I’m a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps).
I do think this suggests that you can’t in fact “just 10x the learning rate” once grokking starts, which seems like a hit to my story.
I’m kinda confused at your perspective on learning rates. I usually think of learning rates as being set to the maximum possible value such that training is still stable. So it would in fact be surprising if you could just 10x them to speed up convergence. (So an additional aspect of my prediction would be that you can’t 10x the learning rate at the beginning of training; if you could then it seems like the hyperparameters were chosen poorly and that should be fixed first.)
Indeed in your experiments at the moment you 10x the learning rate accuracy does in fact plummet! I’m a bit surprised it manages to recover, but you can see that the recovery is not nearly as stable as the original training before increasing the learning rate (this is even more obvious in the 50x case), and notably even the recovery for the training accuracy looks like it takes longer (1000-2000 steps) than the original increase in training accuracy (~400 steps).
I do think this suggests that you can’t in fact “just 10x the learning rate” once grokking starts, which seems like a hit to my story.