Chris Beacham comments on Regularization Causes Modularity Causes Generalization

Chris Beacham 3 Jan 2022 21:31 UTC
2 points
Are large models like Mu-Zero, or GPT3 trained with these kinds of dropout/modularity/generalizability techniques? Or should we expect that we might be able to make even more capable models by incorporating this?
- dkirmani 3 Jan 2022 21:55 UTC
  1 point
  Parent
  Good question! I’ll go look at those two papers.
  - The GPT-3 paper doesn’t mention dropout, but it does mention using Decoupled Weight Decay Regularization, which is apparently equivalent to L2 regularization under SGD (but not Adam!). I imagine something called ‘Weight Decay’ imposes a connection cost.
  - The MuZero paper reports using L2 regularization, but not dropout.
  My intuition says that dropout is more useful when working with supervised learning on a not-massive dataset for a not-massive model, although I’m not yet sure why this is. I suspect this conceptual hole is somehow related to Deep Double Descent, which I don’t yet understand on an intuitive level (Edit: looks like nobody does). I also suspect that GPT-3 is pretty modular even without using any of those tricks I listed.