Thanks for trying to walk me through this more, though I’m not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that y=(a+b)x2+cx+d which has four parameters a,b,c,d but of course you don’t really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial’s redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:
Type A: like the fourth degree polynomial’s redundancy which impairs generalizability since it is merely an artifact of the limited training data, and
Type B: like the new model’s redundancy which does not impair generalizability compared to some non-redundant version of it since it is a redundancy in all outputs
Moreover, my intuition is that a highly over-parametrized neural net has much more Type A redundancy than Type B. Is this intuition wrong? That seems perhaps the definition of “over-parametrized”: a model with a lot of Type A redundancy. But maybe I instead am wrong to be looking at the loss function in the first place?
I’m confused now too. Let’s see if I got it right:
A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance.
B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance).
The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset.
Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting y = f(x) for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then f1(x)=ab+cx has minimal loss set {a=0 or b=0 and c=0}. That has a singularity at (0,0,0). I don’t really see why it would generalize better than f2(x)=a+cx or f3(x)=a+b+cx, neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?
Consider instead f4(x)=a+bx+cdx2. Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to f4(x)=a+bx+0 since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where f4 has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) L(x)=a⋅min((x−b)2,(y−b)2). There’s a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model p(⋅|w) (or fw(⋅) if you’re making deterministic predictions) and the true distribution q(x) lead to singularities in the loss landscape Ln(x). There’s an important distinction between f and L.
So that example is of L, what is the f for it? Obviously, there’s multiple f that could give that (depending on how the loss is computed from f), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of f (which give type B behavior) but instead are talking about symmetries of L (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are a,b but the plots show it it varying x,y for fixed a=1,b=0? Treating it as written, there’s not actually a singularity in its parameters a,b.
This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.
Thanks for trying to walk me through this more, though I’m not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that y=(a+b)x2+cx+d which has four parameters a,b,c,d but of course you don’t really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial’s redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:
Type A: like the fourth degree polynomial’s redundancy which impairs generalizability since it is merely an artifact of the limited training data, and
Type B: like the new model’s redundancy which does not impair generalizability compared to some non-redundant version of it since it is a redundancy in all outputs
Moreover, my intuition is that a highly over-parametrized neural net has much more Type A redundancy than Type B. Is this intuition wrong? That seems perhaps the definition of “over-parametrized”: a model with a lot of Type A redundancy. But maybe I instead am wrong to be looking at the loss function in the first place?
I’m confused now too. Let’s see if I got it right:
A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance.
B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance).
The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset.
Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting
y = f(x)
for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then f1(x)=ab+cx has minimal loss set {a=0 or b=0 and c=0}. That has a singularity at (0,0,0). I don’t really see why it would generalize better than f2(x)=a+cx or f3(x)=a+b+cx, neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?Consider instead f4(x)=a+bx+cdx2. Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to f4(x)=a+bx+0 since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where f4 has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) L(x)=a⋅min((x−b)2,(y−b)2). There’s a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model p(⋅|w) (or fw(⋅) if you’re making deterministic predictions) and the true distribution q(x) lead to singularities in the loss landscape Ln(x). There’s an important distinction between f and L.
So that example is of L, what is the f for it? Obviously, there’s multiple f that could give that (depending on how the loss is computed from f), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of f (which give type B behavior) but instead are talking about symmetries of L (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are a,b but the plots show it it varying x,y for fixed a=1,b=0? Treating it as written, there’s not actually a singularity in its parameters a,b.
This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.