The GPT-3 paper doesn’t mention dropout, but it does mention using Decoupled Weight Decay Regularization, which is apparently equivalent to L2 regularization under SGD (but not Adam!). I imagine something called ‘Weight Decay’ imposes a connection cost.
The MuZero paper reports using L2 regularization, but not dropout.
My intuition says that dropout is more useful when working with supervised learning on a not-massive dataset for a not-massive model, although I’m not yet sure why this is. I suspect this conceptual hole is somehow related to Deep Double Descent, which I don’t yet understand on an intuitive level (Edit: looks like nobody does). I also suspect that GPT-3 is pretty modular even without using any of those tricks I listed.
Good question! I’ll go look at those two papers.
The GPT-3 paper doesn’t mention dropout, but it does mention using Decoupled Weight Decay Regularization, which is apparently equivalent to L2 regularization under SGD (but not Adam!). I imagine something called ‘Weight Decay’ imposes a connection cost.
The MuZero paper reports using L2 regularization, but not dropout.
My intuition says that dropout is more useful when working with supervised learning on a not-massive dataset for a not-massive model, although I’m not yet sure why this is. I suspect this conceptual hole is somehow related to Deep Double Descent, which I don’t yet understand on an intuitive level (Edit: looks like nobody does). I also suspect that GPT-3 is pretty modular even without using any of those tricks I listed.