I’m confused by the setup. Let’s consider the simplest case: fitting points in the plane, y as a function of x. If I have three datapoints and I fit a quadratic to it, I have a dimension 0 space of minimizers of the loss function: the unique parabola through those three points (assume they’re not ontop of each other). Since I have three parameters in a quadratic, I assume that this means the effective degrees of freedom of the model is 3 according to this post. If I instead fit a quartic, I now have a dimension 1 space of minimizers and 4 parameters, so I think you’re saying degrees of freedom is still 3. And so the DoF would be 3 for all degrees of polynomial models above linear. But I certainly think that we expect that quadratic models will generalize better than 19th degree polynomials when fit to just three points.
On its own the quartic has 4 degrees of freedom (and the 19th degree polynomial 19 DOFs).
It’s not until I introduce additional constraints (independent equations), that the effective dimensionality goes down. E.g.: a quartic + a linear equation = 3 degrees of freedom,
ax4+bx3+cx2+dx+e=0a=b.
It’s these kinds of constraints/relations/symmetries that reduce the effective dimensionality.
This video has a good example of a more realistic case:
I think the objection to this example is that the relevant function to minimize is not loss on the training data but something else? The loss it would have on ‘real data’? That seems to make more sense of the post to me, but if that were the case, then I think any minimizer of that function would be equally good at generalizing by definition. Another candidate would be the parameter-function map you describe which seems to be the relevant map whose singularities we are studying, but we it’s not well defined to ask for minimums (or level-sets) of that at all. So I don’t think that’s right either.
We don’t have access to the “true loss.” We only have access to the training loss (for this case, Kn(w)). Of course the true distribution is sneakily behind the empirical distribution and so has after-effects in the training loss, but it doesn’t show up explicitly in p(Dn) (the thing we’re trying to maximize).
Thanks for trying to walk me through this more, though I’m not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that y=(a+b)x2+cx+d which has four parameters a,b,c,d but of course you don’t really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial’s redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:
Type A: like the fourth degree polynomial’s redundancy which impairs generalizability since it is merely an artifact of the limited training data, and
Type B: like the new model’s redundancy which does not impair generalizability compared to some non-redundant version of it since it is a redundancy in all outputs
Moreover, my intuition is that a highly over-parametrized neural net has much more Type A redundancy than Type B. Is this intuition wrong? That seems perhaps the definition of “over-parametrized”: a model with a lot of Type A redundancy. But maybe I instead am wrong to be looking at the loss function in the first place?
I’m confused now too. Let’s see if I got it right:
A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance.
B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance).
The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset.
Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting y = f(x) for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then f1(x)=ab+cx has minimal loss set {a=0 or b=0 and c=0}. That has a singularity at (0,0,0). I don’t really see why it would generalize better than f2(x)=a+cx or f3(x)=a+b+cx, neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?
Consider instead f4(x)=a+bx+cdx2. Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to f4(x)=a+bx+0 since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where f4 has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) L(x)=a⋅min((x−b)2,(y−b)2). There’s a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model p(⋅|w) (or fw(⋅) if you’re making deterministic predictions) and the true distribution q(x) lead to singularities in the loss landscape Ln(x). There’s an important distinction between f and L.
So that example is of L, what is the f for it? Obviously, there’s multiple f that could give that (depending on how the loss is computed from f), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of f (which give type B behavior) but instead are talking about symmetries of L (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are a,b but the plots show it it varying x,y for fixed a=1,b=0? Treating it as written, there’s not actually a singularity in its parameters a,b.
This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.
On its own the quartic has 4 degrees of freedom (and the 19th degree polynomial 19 DOFs).
It’s not until I introduce additional constraints (independent equations), that the effective dimensionality goes down. E.g.: a quartic + a linear equation = 3 degrees of freedom,
ax4+bx3+cx2+dx+e=0a=b.It’s these kinds of constraints/relations/symmetries that reduce the effective dimensionality.
This video has a good example of a more realistic case:
We don’t have access to the “true loss.” We only have access to the training loss (for this case, Kn(w)). Of course the true distribution is sneakily behind the empirical distribution and so has after-effects in the training loss, but it doesn’t show up explicitly in p(Dn) (the thing we’re trying to maximize).
Thanks for trying to walk me through this more, though I’m not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that y=(a+b)x2+cx+d which has four parameters a,b,c,d but of course you don’t really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial’s redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:
Type A: like the fourth degree polynomial’s redundancy which impairs generalizability since it is merely an artifact of the limited training data, and
Type B: like the new model’s redundancy which does not impair generalizability compared to some non-redundant version of it since it is a redundancy in all outputs
Moreover, my intuition is that a highly over-parametrized neural net has much more Type A redundancy than Type B. Is this intuition wrong? That seems perhaps the definition of “over-parametrized”: a model with a lot of Type A redundancy. But maybe I instead am wrong to be looking at the loss function in the first place?
I’m confused now too. Let’s see if I got it right:
A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance.
B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance).
The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset.
Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting
y = f(x)
for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then f1(x)=ab+cx has minimal loss set {a=0 or b=0 and c=0}. That has a singularity at (0,0,0). I don’t really see why it would generalize better than f2(x)=a+cx or f3(x)=a+b+cx, neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?Consider instead f4(x)=a+bx+cdx2. Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to f4(x)=a+bx+0 since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where f4 has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) L(x)=a⋅min((x−b)2,(y−b)2). There’s a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model p(⋅|w) (or fw(⋅) if you’re making deterministic predictions) and the true distribution q(x) lead to singularities in the loss landscape Ln(x). There’s an important distinction between f and L.
So that example is of L, what is the f for it? Obviously, there’s multiple f that could give that (depending on how the loss is computed from f), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of f (which give type B behavior) but instead are talking about symmetries of L (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are a,b but the plots show it it varying x,y for fixed a=1,b=0? Treating it as written, there’s not actually a singularity in its parameters a,b.
This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.