The RLCT =λ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
Link(s) to your favorite proof(s)?
Also, do these match up with empirical results?
Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
Link(s) to your favorite proof(s)?
Also, do these match up with empirical results?
I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
All proofs are contained in the Watanabe’s standard text, see here
https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A