The plateau corresponds to a period with a lot of bump formation. If bumps really are a sign of vectors competing to represent different chunks of subspace then maybe this says that Adam produces more such competition (maybe by making different vectors learn at more similar rates?).
I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.
I caution against over-interpreting the results of single runs—I think there’s a good chance the number of bumps varies significantly by random seed.
It’s a good caution, but I do see more bumps with Adam than with SGD across a number of random initializations.
(with the caveat that this is still “I tried a few times” and not any quantitative study)