Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
FWIW most potential theories of deep learning are able to explain these, I don’t think this distinguishes SLT particularly much.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.