uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.
uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.