Karpathy’s law: “neural nets want to work”. This is another source of capabilities jumps: where the capability ‘existed’, but there was just a bug that crippled it
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.
I’ve experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn’t cover. It had just never occurred to me that GD could optimize around that.
And on a related note, some big advances—arguably even transformers - are more a case of just getting out of SGD’s way to let it do its thing rather than some huge new insight.