(Adam Jermyn ninja’ed my rank 2 results as I forgot to refresh, lol)
Weight decay just means the gradient becomes −∇xL=2(⟨b,y⟩a−⟨y,y⟩x)−λx, which effectively “extends” the exponential phase. It’s pretty easy to confirm that this is the case:
How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?
Should be trivial to modify my code to use AdamW, just replace SGD with Adam on line 33.
EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?
(Adam Jermyn ninja’ed my rank 2 results as I forgot to refresh, lol)
Weight decay just means the gradient becomes −∇xL=2(⟨b,y⟩a−⟨y,y⟩x)−λx, which effectively “extends” the exponential phase. It’s pretty easy to confirm that this is the case:
You can see the other figures from the main post here:
https://imgchest.com/p/9p4nl6vb7nq
(Lighter color shows loss curve for each of 10 random seeds.)
Here’s my code for the weight decay experiments if anyone wants to play with them or check that I didn’t mess something up: https://gist.github.com/Chanlaw/e8c286629e0626f723a20cef027665d1
Should be trivial to modify my code to use AdamW, just replace
SGD
withAdam
on line 33.EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?
Woah, nice! Note that I didn’t check rank 1 with Adam, just rank >= 2.