In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven’t been fixed or noticed in “Issues” for a long time.
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it (eg R2D2) with a small, often one-liner, fix.
The more you have a self-improving system that feeds back into itself hyperbolically, the more it functions end-to-end and removes the hardwired (human-engineered) parts that Amdahl’s-laws the total output, the more you may go from “pokes around doing nothing much, diverging half the time, beautiful idea, too bad it doesn’t work in the real world” to “FOOM”. (This is also the model of the economy that things like Solow growth models usually lead to: humanity or Europe pokes around doing nothing much discernible, nothing anyone like chimpanzees or the Aztec Empire should worry about, until...)
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.
Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it (eg R2D2) with a small, often one-liner, fix.
The more you have a self-improving system that feeds back into itself hyperbolically, the more it functions end-to-end and removes the hardwired (human-engineered) parts that Amdahl’s-laws the total output, the more you may go from “pokes around doing nothing much, diverging half the time, beautiful idea, too bad it doesn’t work in the real world” to “FOOM”. (This is also the model of the economy that things like Solow growth models usually lead to: humanity or Europe pokes around doing nothing much discernible, nothing anyone like chimpanzees or the Aztec Empire should worry about, until...)
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.