For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
The linked post you wrote about classical learning theory states that the bounds PAC gives are far more loose than what we see in practice for Neural Networks.
In the post you sketch some directions in which tighter bounds may be proven. It is my understanding that these directions have not been pursued further.
Given all that “Fully adequate account of generalization” seems like an overstatement, wouldn’t you agree?
At best we can say that PAC gives a nice toy model for thinking about notions like generalization and learnability as far as I can tell. Maybe I’m wrong- I’m not familiar with the literature- and I’d love to know more about what PAC & classical learning theory can tell us about neural networks.
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don’t think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
The linked post you wrote about classical learning theory states that the bounds PAC gives are far more loose than what we see in practice for Neural Networks. In the post you sketch some directions in which tighter bounds may be proven. It is my understanding that these directions have not been pursued further.
Given all that “Fully adequate account of generalization” seems like an overstatement, wouldn’t you agree?
At best we can say that PAC gives a nice toy model for thinking about notions like generalization and learnability as far as I can tell. Maybe I’m wrong- I’m not familiar with the literature- and I’d love to know more about what PAC & classical learning theory can tell us about neural networks.
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don’t think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.
I see, thanks!