LawrenceC comments on Multi-Component Learning and S-Curves

LawrenceC 1 Dec 2022 20:20 UTC
LW: 2 AF: 2
0
AF
(Adam Jermyn ninja’ed my rank 2 results as I forgot to refresh, lol)

Weight decay just means the gradient becomes $- \nabla_{x} L = 2 (⟨ b, y ⟩ a - ⟨ y, y ⟩ x) - λ x$ , which effectively “extends” the exponential phase. It’s pretty easy to confirm that this is the case:
You can see the other figures from the main post here:
https://imgchest.com/p/9p4nl6vb7nq

(Lighter color shows loss curve for each of 10 random seeds.)
Here’s my code for the weight decay experiments if anyone wants to play with them or check that I didn’t mess something up: https://gist.github.com/Chanlaw/e8c286629e0626f723a20cef027665d1
- LawrenceC 1 Dec 2022 20:21 UTC
  LW: 2 AF: 2
  0
  AF Parent
  How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things. Even worse, how does it interact with AdamW?
  Should be trivial to modify my code to use AdamW, just replace SGD with Adam on line 33.
  
  EDIT: ran the experiments for rank 1, they seem a bit different than Adam Jermyn’s results—it looks like AdamW just accelerates things?
  - Adam Jermyn 1 Dec 2022 22:21 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Woah, nice! Note that I didn’t check rank 1 with Adam, just rank >= 2.