Razied comments on AutoBound on neural network can achieve OOMs lower training loss

Razied 17 Apr 2023 14:37 UTC
11 points
9
in the full-batch setting.
uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.