Some anecdotal evidence: in the last few months I was able to improve on three 2021 conference-published, peer-reviewed DL papers. In each case, the reason I was able to do it was that the authors did not fully understand why the technique they used worked and obviously just wrote a paper around something that they experimentally found to be working. In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven’t been fixed or noticed in “Issues” for a long time. Seems that none of its users went step-by-step or tried to carefully understand what was going on.
What all four of these have in common is that they are still actually working, just not optimally. Their experimental results are not fake. This does not fill me with hope for the future of interpretability.
In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven’t been fixed or noticed in “Issues” for a long time.
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it (eg R2D2) with a small, often one-liner, fix.
The more you have a self-improving system that feeds back into itself hyperbolically, the more it functions end-to-end and removes the hardwired (human-engineered) parts that Amdahl’s-laws the total output, the more you may go from “pokes around doing nothing much, diverging half the time, beautiful idea, too bad it doesn’t work in the real world” to “FOOM”. (This is also the model of the economy that things like Solow growth models usually lead to: humanity or Europe pokes around doing nothing much discernible, nothing anyone like chimpanzees or the Aztec Empire should worry about, until...)
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.
I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that’s a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
The nice thing is that it’s possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don’t need to store the first gradient separately, just multiply it by some factor, don’t zero out the gradients, let the second gradient be accumulated, and rescale the sum.
Some anecdotal evidence: in the last few months I was able to improve on three 2021 conference-published, peer-reviewed DL papers. In each case, the reason I was able to do it was that the authors did not fully understand why the technique they used worked and obviously just wrote a paper around something that they experimentally found to be working. In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven’t been fixed or noticed in “Issues” for a long time. Seems that none of its users went step-by-step or tried to carefully understand what was going on.
What all four of these have in common is that they are still actually working, just not optimally. Their experimental results are not fake. This does not fill me with hope for the future of interpretability.
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it (eg R2D2) with a small, often one-liner, fix.
The more you have a self-improving system that feeds back into itself hyperbolically, the more it functions end-to-end and removes the hardwired (human-engineered) parts that Amdahl’s-laws the total output, the more you may go from “pokes around doing nothing much, diverging half the time, beautiful idea, too bad it doesn’t work in the real world” to “FOOM”. (This is also the model of the economy that things like Solow growth models usually lead to: humanity or Europe pokes around doing nothing much discernible, nothing anyone like chimpanzees or the Aztec Empire should worry about, until...)
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.
Out of curiosity, are you willing to share the papers you improved upon?
I’d like to but it’ll have to wait until I’m finished with a commercial project where I’m using them or until I replace these techniques with something else in my code. I’ll post a reply here once I do. I’d expect somebody else to discover at least one of them in the meantime, they’re not some stunning insights.
One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that’s a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
The nice thing is that it’s possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don’t need to store the first gradient separately, just multiply it by some factor, don’t zero out the gradients, let the second gradient be accumulated, and rescale the sum.
Yes! Anecdotal confirmation of my previously-held beliefs!