Nice survey. The result about double descent even occurring in dataset size is especially surprising.
Regarding the ‘sharp minima can generalize’ paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don’t rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant α and divide the weights of layer 2 by the same constant. The piecewise linearity of ReLU means the output function is unchanged. For large α, the network is now highly sensitive to perturbations in layer 2. These solutions don’t seem like they would be found by SGD, so it could still be that, for solutions found by SGD, flatness and generalization are correlated.
Ah—thanks for the summary. I hadn’t fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.
I’m not sure I agree with interstice’s reading of the ‘sharp minima’ paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function!
Sorry I’m 2 years late to the party...
After doing a quick re-skim I still think my reading is correct. Where in the paper do you think they say that, given a sharp minimum, you can produce a flatter one? The theorems in section 4 are all of the form “given a network, we can find an observationally equivalent one with an arbitrarily large spectral norm”, a.k.a. given any network, it’s always possible to find a sharper equivalent network.
Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:
every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat min- imum when considering nonzero eigenvalues of the Hessian;
If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!
Yes, they do say they can make the minima flatter in section 5.1 -- but, as you say, only by re-parameterizing. This makes the result totally vacuous! Like yeah, obviously if you’re allowed to arbitrarily re-define the mapping from parameters to functions you can make the minima as “sharp” or as “flat” as you want. For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I’ve been thinking about the flatness results from a compression perspective. Being at a flat minimum reduces your description length since you only have to specify the coordinates of the basin you’re in, and flatter minima correspond to wider basins which can be specified with lower-precision coordinates. Re-parameterizing breaks this by introducing extra complexity into the parameter-function map. The ability to always obtain sharper but not flatter minima is kind of analogous to how, given a program, there are infinitely many longer programs replicating its behavior, but finitely many or none shorter such programs.
Vacuous sure, but still true, and seems relevant to me. You initially wrote:
Regarding the ‘sharp minima can generalize’ paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don’t rule out flatness as an explanation for the success of SGD.
But, allowing reparametrisation, this seems false? I don’t understand the step in your argument where you ‘rule out reparametrisation’, nor do I really understand what this would mean.
Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. ‘required coordinate precision’).
When I said they didn’t produce flat minima with poor generalization, I meant “they didn’t produce flat minima with poor generalization in the normal parameter space of a neural network”. This is what is relevant to the “flatness as an explanation for generalization” hypothesis, since that hypothesis is about how flatness correlates with generalization in neural networks in practice. That there exist other parameter-function mappings where this is not the case does not refute this. Of course, if you wished to produce a mathematical proof that flat minima generalize this would be an important example to keep in mind—but this post was about high-level scientific hypotheses about neural network generalization, not mathematical proofs. In that context I think it’s correct to say that the paper does not provide a meaningful counterexample.
For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I agree with this (with ‘sharp’ replaced by ‘generalise’, as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.
When I said “the direction is always flat-->sharp”, I meant that their theorems showed you could produce a sharp minimum given a flat one, but not the other way around, sorry if I was unclear.
Definitely agreed that “under which conditions does flatness imply generalization” is a very interesting question. I think this paper has a reasonably satisfying analysis, although I also have some reservations about “SGD as Bayesian sampler” picture.
Nice survey. The result about double descent even occurring in dataset size is especially surprising.
Regarding the ‘sharp minima can generalize’ paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don’t rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant α and divide the weights of layer 2 by the same constant. The piecewise linearity of ReLU means the output function is unchanged. For large α, the network is now highly sensitive to perturbations in layer 2. These solutions don’t seem like they would be found by SGD, so it could still be that, for solutions found by SGD, flatness and generalization are correlated.
Ah—thanks for the summary. I hadn’t fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.
I’m not sure I agree with interstice’s reading of the ‘sharp minima’ paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function! Sorry I’m 2 years late to the party...
After doing a quick re-skim I still think my reading is correct. Where in the paper do you think they say that, given a sharp minimum, you can produce a flatter one? The theorems in section 4 are all of the form “given a network, we can find an observationally equivalent one with an arbitrarily large spectral norm”, a.k.a. given any network, it’s always possible to find a sharper equivalent network.
Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:
If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!
Yes, they do say they can make the minima flatter in section 5.1 -- but, as you say, only by re-parameterizing. This makes the result totally vacuous! Like yeah, obviously if you’re allowed to arbitrarily re-define the mapping from parameters to functions you can make the minima as “sharp” or as “flat” as you want. For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I’ve been thinking about the flatness results from a compression perspective. Being at a flat minimum reduces your description length since you only have to specify the coordinates of the basin you’re in, and flatter minima correspond to wider basins which can be specified with lower-precision coordinates. Re-parameterizing breaks this by introducing extra complexity into the parameter-function map. The ability to always obtain sharper but not flatter minima is kind of analogous to how, given a program, there are infinitely many longer programs replicating its behavior, but finitely many or none shorter such programs.
Vacuous sure, but still true, and seems relevant to me. You initially wrote:
But, allowing reparametrisation, this seems false? I don’t understand the step in your argument where you ‘rule out reparametrisation’, nor do I really understand what this would mean.
Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. ‘required coordinate precision’).
When I said they didn’t produce flat minima with poor generalization, I meant “they didn’t produce flat minima with poor generalization in the normal parameter space of a neural network”. This is what is relevant to the “flatness as an explanation for generalization” hypothesis, since that hypothesis is about how flatness correlates with generalization in neural networks in practice. That there exist other parameter-function mappings where this is not the case does not refute this. Of course, if you wished to produce a mathematical proof that flat minima generalize this would be an important example to keep in mind—but this post was about high-level scientific hypotheses about neural network generalization, not mathematical proofs. In that context I think it’s correct to say that the paper does not provide a meaningful counterexample.
p.s.
I agree with this (with ‘sharp’ replaced by ‘generalise’, as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.
When I said “the direction is always flat-->sharp”, I meant that their theorems showed you could produce a sharp minimum given a flat one, but not the other way around, sorry if I was unclear.
Definitely agreed that “under which conditions does flatness imply generalization” is a very interesting question. I think this paper has a reasonably satisfying analysis, although I also have some reservations about “SGD as Bayesian sampler” picture.