I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.
“My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.”
What general results from statistical and computational learning theory are you referring to here exactly?
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
The impression I got was that SLT is trying to show why (transformers + SGD) behaves anything like an empirical risk minimiser in the first place. Might be wrong though.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.
“My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.”
What general results from statistical and computational learning theory are you referring to here exactly?
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
For a bit more detail, see this post.
The impression I got was that SLT is trying to show why (transformers + SGD) behaves anything like an empirical risk minimiser in the first place. Might be wrong though.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.