I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT.
It is a common misconception that this is what SLT amounts to.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics.
It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.
To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I don’t have the time to recap this story here.
Lucius-Alexander SLT dialogue?
I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.