How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?
Should be trivial to modify my code to use AdamW, just replace SGD with Adam on line 33.
EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?
Should be trivial to modify my code to use AdamW, just replace
SGD
withAdam
on line 33.EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?
Woah, nice! Note that I didn’t check rank 1 with Adam, just rank >= 2.