Are large models like Mu-Zero, or GPT3 trained with these kinds of dropout/modularity/generalizability techniques? Or should we expect that we might be able to make even more capable models by incorporating this?
The GPT-3 paper doesn’t mention dropout, but it does mention using Decoupled Weight Decay Regularization, which is apparently equivalent to L2 regularization under SGD (but not Adam!). I imagine something called ‘Weight Decay’ imposes a connection cost.
The MuZero paper reports using L2 regularization, but not dropout.
My intuition says that dropout is more useful when working with supervised learning on a not-massive dataset for a not-massive model, although I’m not yet sure why this is. I suspect this conceptual hole is somehow related to Deep Double Descent, which I don’t yet understand on an intuitive level (Edit: looks like nobody does). I also suspect that GPT-3 is pretty modular even without using any of those tricks I listed.
Are large models like Mu-Zero, or GPT3 trained with these kinds of dropout/modularity/generalizability techniques? Or should we expect that we might be able to make even more capable models by incorporating this?
Good question! I’ll go look at those two papers.
The GPT-3 paper doesn’t mention dropout, but it does mention using Decoupled Weight Decay Regularization, which is apparently equivalent to L2 regularization under SGD (but not Adam!). I imagine something called ‘Weight Decay’ imposes a connection cost.
The MuZero paper reports using L2 regularization, but not dropout.
My intuition says that dropout is more useful when working with supervised learning on a not-massive dataset for a not-massive model, although I’m not yet sure why this is. I suspect this conceptual hole is somehow related to Deep Double Descent, which I don’t yet understand on an intuitive level (Edit: looks like nobody does). I also suspect that GPT-3 is pretty modular even without using any of those tricks I listed.