I think it’s worth noting that no large-scale system uses ‘true’ SGD; it’s all ADAM-W and the weight decay seems like a strong part of the inductive bias. Of course “everything that works is approximately Bayesian”, but the mathematics that people talk about with respect to SGD just aren’t relevant to practice.
I think it’s worth noting that no large-scale system uses ‘true’ SGD; it’s all ADAM-W and the weight decay seems like a strong part of the inductive bias. Of course “everything that works is approximately Bayesian”, but the mathematics that people talk about with respect to SGD just aren’t relevant to practice.
(opinions my own)