Based on my understanding of grokking, it is just something that will always occur in a suitable regime: sufficient model capacity (overcomplete) combined with proper complexity regularization. If you optimize on the test hard enough the model will first memorize the test set and get to near zero predictive error, but non zero regularization error. At that point if you continue optimizing there is only one way the system can further improve the loss function, and that is by maintaining low/zero prediction error while also reducing the regularization penalty error and thus model complexity. It seems inevitable given the right setup and sufficient optimization time (although typical training regimes may not suffice in practice).
Sure, that’s quite plausible. Though I should have been clear and said I wanted some examples of grokking in Deep RL. Mostly because I was thinking of running some experiments trying to prevent grokking ala Omnigrok and wanted to see what the best examples of grokking were.
To see if Omnigrok’s mechanism for enabling/stopping grokking works beyond the three areas they investigated. If it works, then we are more sure we know how to stop it occuring, and instead force the model to reach the same performance incrementally. Which might make it easier to predict future performance, but also just to get some more info about the phenomenon. Plus, like, I’m implementing some deep RL algorithms anyway, so might as well, right?
Based on my understanding of grokking, it is just something that will always occur in a suitable regime: sufficient model capacity (overcomplete) combined with proper complexity regularization. If you optimize on the test hard enough the model will first memorize the test set and get to near zero predictive error, but non zero regularization error. At that point if you continue optimizing there is only one way the system can further improve the loss function, and that is by maintaining low/zero prediction error while also reducing the regularization penalty error and thus model complexity. It seems inevitable given the right setup and sufficient optimization time (although typical training regimes may not suffice in practice).
Sure, that’s quite plausible. Though I should have been clear and said I wanted some examples of grokking in Deep RL. Mostly because I was thinking of running some experiments trying to prevent grokking ala Omnigrok and wanted to see what the best examples of grokking were.
Curios—why would you want to prevent grokking? Normally one would want to encourage it.
To see if Omnigrok’s mechanism for enabling/stopping grokking works beyond the three areas they investigated. If it works, then we are more sure we know how to stop it occuring, and instead force the model to reach the same performance incrementally. Which might make it easier to predict future performance, but also just to get some more info about the phenomenon. Plus, like, I’m implementing some deep RL algorithms anyway, so might as well, right?