I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always logp(y|x,θ) (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they’re likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters θ∗=argmaxθlogp(y|x,θ).
But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.
I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame.
This theory is very bayes-law inspired, though I don’t tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much).
There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don’t believe this is a crux for me here.
If our models end up different, I think there’s a bunch of things which you end up being utterly confused by in deep learning, which I’m not[2].
In the sense that though I’m confused about lots of the technical details, I would know exactly which books or math or people I should consult to no longer be confused.
In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system
I disagree. An inductive bias is not necessarily a prior distribution. What’s the prior?
The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
Also, side-comment: Thanks for the discussion! Its fun.
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they’re using to do predictions, it does not work as well as the NN. So you can’t explain that the NN works well by appealing that it’s similar to this particular Bayesian posterior.
I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it’s basically correct, so maybe this is the one.
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they’re using to do predictions, it does not work as well as the NN. So you can’t explain that the NN works well by appealing that it’s similar to this particular Bayesian posterior.
Yup this changes my mind about the relevance of this paper.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always logp(y|x,θ) (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they’re likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters θ∗=argmaxθlogp(y|x,θ).
But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.
I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame.
This theory is very bayes-law inspired, though I don’t tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much).
There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don’t believe this is a crux for me here.
If our models end up different, I think there’s a bunch of things which you end up being utterly confused by in deep learning, which I’m not[2].
At the same time repeating that to me this doesn’t seem that relevant to the true question.
In the sense that though I’m confused about lots of the technical details, I would know exactly which books or math or people I should consult to no longer be confused.
I disagree. An inductive bias is not necessarily a prior distribution. What’s the prior?
From another comment of mine:
Also, side-comment: Thanks for the discussion! Its fun.
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.
OK, let’s look through the papers you linked.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they’re using to do predictions, it does not work as well as the NN. So you can’t explain that the NN works well by appealing that it’s similar to this particular Bayesian posterior.
I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it’s basically correct, so maybe this is the one.
Yup this changes my mind about the relevance of this paper.
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.