I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
That seems probable. Maybe it’s useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we’re on the same page. When people refer to the “generalisation puzzle” in deep learning I think they mean two related but distinct things:
(i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam’s razor like the BIC (ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).
In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.
Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don’t feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that “test error” as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don’t think it’s crucial to our discussion I’ll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi’s upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity).
As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn’t supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider “natural” datasets. We don’t understand what distinguishes neural networks from generic singular models, nor what we mean by “natural”. These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).
Maybe that has significant overlap with the critique of SLT you’re making?
Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the “ML in-practice model selection” story to “choose the model with the best test error given fixed training steps” and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe’s theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as
Gg(n)=L0+1n(λ+ν)
where λ is the learning coefficient and ν is the singular fluctuation and L0 is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I’ll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low L0,λ,ν.
I don’t know much about this question, and agree it is important and outstanding.
Again, I think this reduction is not trivial since the link between λ,ν and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don’t think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like “overparameterised” no longer seem very significant.
Maybe that has significant overlap with the critique of SLT you’re making?
Yes, this is basically exactly what my criticism of SLT is—I could not have described it better myself!
Again, I think this reduction is not trivial since the link between λ, ν and generalisation error is nontrivial.
I agree that this reduction is relevant and non-trivial. I don’t have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).
However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)
That seems probable. Maybe it’s useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we’re on the same page. When people refer to the “generalisation puzzle” in deep learning I think they mean two related but distinct things:
(i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam’s razor like the BIC
(ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).
In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.
Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don’t feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that “test error” as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don’t think it’s crucial to our discussion I’ll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi’s upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity).
As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn’t supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider “natural” datasets. We don’t understand what distinguishes neural networks from generic singular models, nor what we mean by “natural”. These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).
Maybe that has significant overlap with the critique of SLT you’re making?
Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the “ML in-practice model selection” story to “choose the model with the best test error given fixed training steps” and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe’s theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as
Gg(n)=L0+1n(λ+ν)
where λ is the learning coefficient and ν is the singular fluctuation and L0 is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I’ll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low L0,λ,ν.
I don’t know much about this question, and agree it is important and outstanding.
Again, I think this reduction is not trivial since the link between λ,ν and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)
Yes, absolutely. However, I also don’t think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like “overparameterised” no longer seem very significant.
Yes, this is basically exactly what my criticism of SLT is—I could not have described it better myself!
I agree that this reduction is relevant and non-trivial. I don’t have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).
Seems reasonable to me!