If singular learning theory turns out to capture a majority of the inductive biases of deep learning, then I don’t think there will be anything “after” deep learning. I think deep learning represents the final paradigm for capable learning systems.
Any learning systems worth considering must have a lot of flexible internal representational capacity, which it can adapt to model external reality, as well as some basically local update rule for making the system better at modeling reality. Unless this representational capacity is specifically initialized to respect every possible symmetry in the data it will ever try to understand (which seems ~impossible, and like it defeats the entire point of having a learning system at all), then the internal state of the system’s representations will be underdetermined by the data at hand.
This means the system will have many internal states whose behaviors are ~ functionally equivalent under whatever learning rules it uses to adapt to data. At this point, singular learning theory applies, regardless of the specifics of your update rule. You might not have an explicitly defined “loss function”, you might not use something you call “gradient descent” to modify the internals of your system, and you might not even use any matrix math at all, but at its core, your system will basically be doing deep learning. It’s inductive biases will mostly come from how the geometry of the data you provide it interacts with the local properties of the update rule, and you’ll see it make similar generalizations to a deep learning system (assuming both have learned from a large enough amount of data).
Whether singular learning theory actually yields you anything useful when your optimiser converges to the largest singularity seems very much architecture dependent though? If you fit a 175 billion degree polynomial to the internet to do next token prediction, I think you’ll get out nonsense, not GPT-3. Because a broad solution in the polynomial landscape does not seem to equal a Kolomogorov-simple solution to the same degree it does with MLPs or transformers.
Likewise, there doesn’t seem to be anything saying you can’t have an architecture with an even better simplicity and speed bias than the MLP family has.
If singular learning theory turns out to capture a majority of the inductive biases of deep learning, then I don’t think there will be anything “after” deep learning. I think deep learning represents the final paradigm for capable learning systems.
Any learning systems worth considering must have a lot of flexible internal representational capacity, which it can adapt to model external reality, as well as some basically local update rule for making the system better at modeling reality. Unless this representational capacity is specifically initialized to respect every possible symmetry in the data it will ever try to understand (which seems ~impossible, and like it defeats the entire point of having a learning system at all), then the internal state of the system’s representations will be underdetermined by the data at hand.
This means the system will have many internal states whose behaviors are ~ functionally equivalent under whatever learning rules it uses to adapt to data. At this point, singular learning theory applies, regardless of the specifics of your update rule. You might not have an explicitly defined “loss function”, you might not use something you call “gradient descent” to modify the internals of your system, and you might not even use any matrix math at all, but at its core, your system will basically be doing deep learning. It’s inductive biases will mostly come from how the geometry of the data you provide it interacts with the local properties of the update rule, and you’ll see it make similar generalizations to a deep learning system (assuming both have learned from a large enough amount of data).
Whether singular learning theory actually yields you anything useful when your optimiser converges to the largest singularity seems very much architecture dependent though? If you fit a 175 billion degree polynomial to the internet to do next token prediction, I think you’ll get out nonsense, not GPT-3. Because a broad solution in the polynomial landscape does not seem to equal a Kolomogorov-simple solution to the same degree it does with MLPs or transformers.
Likewise, there doesn’t seem to be anything saying you can’t have an architecture with an even better simplicity and speed bias than the MLP family has.
This seems correct to me.