Lech Mazur comments on DL towards the unaligned Recursive Self-Optimization attractor

Lech Mazur 18 Dec 2021 16:24 UTC
10 points
I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
- Lech Mazur 10 Feb 2022 1:57 UTC
  1 point
  Parent
  One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that’s a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
  The nice thing is that it’s possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don’t need to store the first gradient separately, just multiply it by some factor, don’t zero out the gradients, let the second gradient be accumulated, and rescale the sum.