First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low n (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we’re using in LLMs for example.
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.
“My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.”
What general results from statistical and computational learning theory are you referring to here exactly?
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
The impression I got was that SLT is trying to show why (transformers + SGD) behaves anything like an empirical risk minimiser in the first place. Might be wrong though.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).
I’ve discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might be more up your alley than SLT.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low n (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we’re using in LLMs for example.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.
“My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.”
What general results from statistical and computational learning theory are you referring to here exactly?
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ.
This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
For a bit more detail, see this post.
The impression I got was that SLT is trying to show why (transformers + SGD) behaves anything like an empirical risk minimiser in the first place. Might be wrong though.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.