Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!
But also yes… I think I am saying that
Singular Learning Theory is the first highly accurate model of breath of optima.
SLT tells us to look at a quantity Watanabe calls λ, which has the highly-technical name ’real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
By computing simple examples (see Shaowei’s guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
The RLCT =λ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
The RLCT =λ first-order term for in-distribution generalization error
Clarification: The ‘derivation’ for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don’t like this derivation very much. See e.g. this one on Wikipedia.
So what it’s actually showing is just that:
If you’ve got a class of different hypotheses M, containing many individual hypotheses {θ1,θ2,…θN} .
And you’ve got a prior ahead of time that says the chance any one of the hypotheses in M is true is some number p(M)<1., let’s say it’s p(M)=0.8 as an example.
And you distribute this total probability p(M)=0.8 around the different hypotheses in an even-ish way, so p(θi,M)∝1N, roughly.
And then you encounter a bunch of data X (the training data) and find that only one or a tiny handful of hypotheses in M fit that data, so p(X|θi,M)≠0 for basically only one hypotheses θi…
Then your posterior probability p(M|X)=p(X|M)0.80.8p(X|M)+0.2p(X|¬M) that the hypothesis θi is correct will probably be tiny, scaling with 1N. If we spread your prior p(M)=0.8 over lots of hypotheses, there isn’t a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless p(X|¬M) is really small, i.e. no hypothesis outside the set M can explain the data either.
So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we’d have 232k hypotheses if our function fits used k32-bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as N goes to infinity.
So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we’re spreading our prior around exponentially many hypotheses.… if we have the sort of prior that says all the hypotheses are about equally likely.
But that’s an insane prior to have! We only have 1.0 worth of probability to go around, and there’s an infinite number of different hypotheses. Which is why you’re supposed to assign prior based on K-complexity, or at least something that doesn’t go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
In summary: badly normalised priors behave badly
SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don’t line up one-to-one with hypotheses.
It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take N→∞. That isa basic requirement a sane prior needs to have, so we’re at least not completely shooting ourselves in the foot anymore. But that still doesn’t show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it’s no longer obviously wrong in this specific way.
Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
I dunno. SLT doesn’t say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
That’s all though. It doesn’t tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
How to make this story tighter?
If people aim to make further headway on the question of why some function fits generalise somewhat and others don’t, beyond: ‘Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn’t actively bad’, then I’d suggest a starting point might be to make a different derivation for the posterior on the fits that isn’t trying to reason about p(M) defined as the probability that one of the function fits is ‘true’ in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a 150 billion parameter transformer to internet data, we don’t expect going in that any of these 216×150×109 parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of M, which the SLT derivation of the posterior and most other derivations of this sort I’ve seen seem to implicitly make, we basically have p(M)≈0 going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like M='one of these models might get <1.1 average loss on holdout data sets' .
SLT in three sentences
‘You thought your choice of prior was broken because it’s nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here’s a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished’.
SLT in one sentence
‘Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.’
Sorta, kind of, arguably. There’s some stuff left to work out here. For example vanilla SLT doesn’t even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT.
It is a common misconception that this is what SLT amounts to.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics.
It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
The RLCT =λ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
Link(s) to your favorite proof(s)?
Also, do these match up with empirical results?
Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!
But also yes… I think I am saying that
Singular Learning Theory is the first highly accurate model of breath of optima.
SLT tells us to look at a quantity Watanabe calls λ, which has the highly-technical name ’real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
By computing simple examples (see Shaowei’s guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
The RLCT =λ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
It’s one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.
EDIT: no hype about future work. Wait and see ! :)
Clarification: The ‘derivation’ for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don’t like this derivation very much. See e.g. this one on Wikipedia.
So what it’s actually showing is just that:
If you’ve got a class of different hypotheses M, containing many individual hypotheses {θ1,θ2,…θN} .
And you’ve got a prior ahead of time that says the chance any one of the hypotheses in M is true is some number p(M)<1., let’s say it’s p(M)=0.8 as an example.
And you distribute this total probability p(M)=0.8 around the different hypotheses in an even-ish way, so p(θi,M)∝1N, roughly.
And then you encounter a bunch of data X (the training data) and find that only one or a tiny handful of hypotheses in M fit that data, so p(X|θi,M)≠0 for basically only one hypotheses θi…
Then your posterior probability p(M|X)=p(X|M)0.80.8p(X|M)+0.2p(X|¬M) that the hypothesis θi is correct will probably be tiny, scaling with 1N. If we spread your prior p(M)=0.8 over lots of hypotheses, there isn’t a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless p(X|¬M) is really small, i.e. no hypothesis outside the set M can explain the data either.
So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we’d have 232k hypotheses if our function fits used k 32-bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as N goes to infinity.
So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we’re spreading our prior around exponentially many hypotheses.… if we have the sort of prior that says all the hypotheses are about equally likely.
But that’s an insane prior to have! We only have 1.0 worth of probability to go around, and there’s an infinite number of different hypotheses. Which is why you’re supposed to assign prior based on K-complexity, or at least something that doesn’t go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
In summary: badly normalised priors behave badly
SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don’t line up one-to-one with hypotheses.
It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take N→∞. That is a basic requirement a sane prior needs to have, so we’re at least not completely shooting ourselves in the foot anymore. But that still doesn’t show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it’s no longer obviously wrong in this specific way.
Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
I dunno. SLT doesn’t say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
That’s all though. It doesn’t tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
How to make this story tighter?
If people aim to make further headway on the question of why some function fits generalise somewhat and others don’t, beyond: ‘Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn’t actively bad’, then I’d suggest a starting point might be to make a different derivation for the posterior on the fits that isn’t trying to reason about p(M) defined as the probability that one of the function fits is ‘true’ in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a 150 billion parameter transformer to internet data, we don’t expect going in that any of these 216×150×109 parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of M, which the SLT derivation of the posterior and most other derivations of this sort I’ve seen seem to implicitly make, we basically have p(M)≈0 going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like M='one of these models might get <1.1 average loss on holdout data sets' .
SLT in three sentences
‘You thought your choice of prior was broken because it’s nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here’s a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished’.
SLT in one sentence
‘Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.’
Sorta, kind of, arguably. There’s some stuff left to work out here. For example vanilla SLT doesn’t even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I don’t have the time to recap this story here.
Lucius-Alexander SLT dialogue?
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
Link(s) to your favorite proof(s)?
Also, do these match up with empirical results?
I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
All proofs are contained in the Watanabe’s standard text, see here
https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A