I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that’s a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
The nice thing is that it’s possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don’t need to store the first gradient separately, just multiply it by some factor, don’t zero out the gradients, let the second gradient be accumulated, and rescale the sum.
I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that’s a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
The nice thing is that it’s possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don’t need to store the first gradient separately, just multiply it by some factor, don’t zero out the gradients, let the second gradient be accumulated, and rescale the sum.