Here’s a concrete toy example where SLT and this post give different answers (SLT is more specific). Let f(θ)(x)=θ1θ2x. And let L(f(θ))=|f(θ)(1)|2=θ21θ22. Then the minimal loss is achieved at the set of parameters where θ1=0or θ2=0 (note that this looks like two intersecting lines, with the singularity being the intersection). Note that all θ in that set also give the same exact f(θ). The theory in your post here doesn’t say much beyond the standard point that gradient descent will (likely) select a minimal or near-minimal θ, but it can’t distinguish between the different values of θ within that minimal set.
SLT on the other hand says that gradient descent will be more likely to choose the specific singular value θ1=0=θ2 .
Now I’m not sure this example is sufficiently realistic to demonstrate why you would care about SLT’s extra specificity, since in this case I’m perfectly happy with any value of θ in the minimal set—they all give the exact same f(θ). If I were to try to generalize this into a useful example, I would try to find a case where L(f(θ)) has a minimal set that contains multiple different f(θ). For example, L only evaluates f(θ) on a subset of points (the ‘training data’) but different choices of minimal θ give different values outside of that subset of training data. Then we can consider which f(θ) has the best generalization to out-of-training data—do the parameters predicted by SLT yield f(θ) that are best at generalizing?
Disclaimer: I have a very rudimentary understanding of SLT and may be misrepresenting it.
I don’t think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn’t vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh ε, say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
The reason you have to do some kind of “translation” between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some “resolution” or “scale” ε, i.e. if you discretized the loss function by evaluating it on a lattice with mesh ε you get my picture. Of course, this loses the information of what happens as ε→0 and n→∞ in some thermodynamic limit, which is what you recover when you do SLT.
I just don’t see what this thermodynamic limit tells you about the learning behavior of NNs that we didn’t know before. We already know NNs approximate Solomonoff induction if the A-complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what A looks like as a smooth function as opposed to a discrete function?
In addition, the strong dependence of SLT on A being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that “we can just pretend NNs are analytic” is more pernicious than e.g. “we can just pretend NNs are smooth”. Typical approximation theorems like Stone-Weierstrass also fail to save you because they only work in the sup-norm and that’s completely useless for determining behavior at singularities. So I’m yet to be convinced that the additional details in SLT provide a more useful account of NN learning than my simple description above.
The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn’t vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh ε, say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
As I read it, the arguments you make in the original post depend only on the macrostate f, which is the same for both the singular and non-singular points of the minimal loss set (in my example), so they can’t distinguish these points at all. I see that you’re also applying the logic to points near the minimal set and arguing that the nearly-optimal points are more abundant near the singularities than near the non-singularities. I think that’s a significant point not made at all in your original point that brings it closer to SLT, so I’d encourage you to add it to the post.
I think there’s also terminology mismatch between your post and SLT. You refer to singularities of A(i.e. its derivative is degenerate) while SLT refers to singularities of the set of minimal loss parameters. The point θ=(0,1) in my example is not singular at all in SLT but A(θ) is singular. This terminology collision makes it sound like you’ve recreated SLT more than you actually have.
I’m not too sure how to respond to this comment because it seems like you’re not understanding what I’m trying to say.
I agree there’s some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you’re trying to represent in some suitable way. If you do this, then you’ll notice that the worse a singularity is, the lower the A-complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization.
The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn’t become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I’m not sure what this claim is supposed to tell us about how NNs learn. I can’t make any novel predictions about NNs with this knowledge that I couldn’t before.
In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I’m not sure what this claim is supposed to tell us about how NNs learn
I think the implied claim is something like “analyzing the singularities of the model will also be helpful for understanding SGD in more realistic settings” or maybe just “investigating this area further will lead to insights which are applicable in more realistic settings”. I mostly don’t buy it myself.
the worse a singularity is, the lower the A-complexity of the corresponding discrete function will turn out to be
This is where we diverge. Please let me know where you think my error is in the following. Returning to my explicit example (though I wrote f(θ) originally but will instead use A(θ) in this post since that matches your definitions).
1. Let f0(x)=0x be the constant zero function and S=A−1(f0).
2. Observe that S is the minimal loss set under our loss function and also S is the set of parameters θ=(θ1,θ2) where θ1=0 or θ2=0.
3. Let α,β∈S . Then A−1(α)=f0=A−1(β) by definition of S. Therefore, c(A(α))=c(A(β)).
4. SLT says that θ=(0,0) is a singularity of S but that θ=(0,1)∈S is not a singularity.
5. Therefore, there exists a singularity (according to SLT) which has identical A-complexity (and also loss) as a non-singular point, contradicting your statement I quote.
You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you’re not going to see any of the interesting behavior SLT is capturing.
In your case, let’s say that we discretize the function space by choosing which one of the functions gk(x)=kηx you’re closest to for some η>0. In addition, we also discretize the codomain of A by looking at the lattice (εZ)2 for some ε>0. Now, you’ll notice that there’s a radius ∼√η disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area ε2 this means the “relative weight” of the singularity at the origin is like O(η/ε2).
In contrast, all other points mapping to the zero function only get a relative weight of O(η/(kε2)) where kε is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all kε>√η to exclude the disk at the origin gives O(√η/ε) for the total contribution of all the other points in the minimum loss set. So in the limit η/ε2→0 the singularity at the origin accounts for almost everything in the preimage of A. The origin is privileged in my picture just as it is in the SLT picture.
I think your mistake is that you’re trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it’s a discretization at a particular scale, it doesn’t capture what happens as the scale is changing. That’s the main shortcoming relative to SLT, but it’s not clear to me how important capturing this thermodynamic-like limit is to begin with.
Again, maybe I’m misrepresenting the actual content of SLT here, but it’s not clear to me what SLT says aside from this, so...
Everything I wrote in steps 1-4 was done in a discrete setting (otherwise |A−1(f0)| is not finite and whole thing falls apart). I was intending θ to be pairs of floating point numbers and A to be floats to floats.
However, using that I think I see what you’re trying to say. Which is that θ1θ2 will equal zero for some cases where θ1 and θ2 are both non-zero but very small and will multiply down to zero due to the limits of floating point numbers. Therefore the pre-image of A−1(f0) is actually larger than I claimed, and specifically contains a small neighborhood of (0,0).
That doesn’t invalidate my calculation that shows that (0,0) is equally likely as (0,1) though: they still have the same loss and A-complexity (since they have the same macrostate). On the other hand, you’re saying that there are points in parameter space that are very close to (0,0) that are also in this same pre-image and also equally likely. Therefore even if (0,0) is just as likely as (0,1), being near to (0,0) is more likely than being near to (0,1). I think it’s fair to say that that is at least qualitatively the same as SLT gives in the continous version of this.
However, I do think this result “happened” due to factors that weren’t discussed in your original post, which makes it sound like it is “due to” A-complexity. A-complexity is a function of the macrostate, which is the same at all of these points and so does not distinguish between (0,0) and (0,1) at all. In other words, your post tells me which f is likely while SLT tells me which θ is likely—these are not the same thing. But you clearly have additional ideas not stated in the post that also help you figure out which θ is likely. Until that is clarified, I think you have a mental theory of this which is very different from what you wrote.
Sure, I agree that I didn’t put this information into the post. However, why do you need to know which θ is more likely to know anything about e.g. how neural networks generalize?
I understand that SLT has some additional content beyond what is in the post, and I’ve tried to explain how you could make that fit in this framework. I just don’t understand why that additional content is relevant, which is why I left it out.
As an additional note, I wasn’t really talking about floating point precision being the important variable here. I’m just saying that if you want A-complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you’re interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don’t know why you would try to do that.
In my view, it’s a significant philosophical difference between SLT and your post that your post talks only about choosing macrostates while SLT talks about choosing microstates. I’m much less qualified to know (let alone explain) the benefits of SLT, though I can speculate. If we stop training after a finite number of steps, then I think it’s helpful to know where it’s converging to. In my example, if you think it’s converging to (0,1), then stopping close to that will get you a function that doesn’t generalize too well. If you know it’s converging to (0,0) then stopping close to that will get you a much better function—possibly exactly equally as good as you pointed out due to discretization.
Now this logic is basically exactly what you’re saying in these comments! But I think if someone read your post without prior knowledge of SLT, they wouldn’t figure out that it’s more likely to converge to a point near (0,0) than near (0,1). If they read an SLT post instead, they would figure that out. In that sense, SLT is more useful.
I am not confident that that is the intended benefit of SLT according to its proponents, though. And I wouldn’t be surprised if you could write a simpler explanation of this in your framework than SLT gives, I just think that this post wasn’t it.
Here’s a concrete toy example where SLT and this post give different answers (SLT is more specific). Let f(θ)(x)=θ1θ2x. And let L(f(θ))=|f(θ)(1)|2=θ21θ22. Then the minimal loss is achieved at the set of parameters where θ1=0 or θ2=0 (note that this looks like two intersecting lines, with the singularity being the intersection). Note that all θ in that set also give the same exact f(θ). The theory in your post here doesn’t say much beyond the standard point that gradient descent will (likely) select a minimal or near-minimal θ, but it can’t distinguish between the different values of θ within that minimal set.
SLT on the other hand says that gradient descent will be more likely to choose the specific singular value θ1=0=θ2 .
Now I’m not sure this example is sufficiently realistic to demonstrate why you would care about SLT’s extra specificity, since in this case I’m perfectly happy with any value of θ in the minimal set—they all give the exact same f(θ). If I were to try to generalize this into a useful example, I would try to find a case where L(f(θ)) has a minimal set that contains multiple different f(θ). For example, L only evaluates f(θ) on a subset of points (the ‘training data’) but different choices of minimal θ give different values outside of that subset of training data. Then we can consider which f(θ) has the best generalization to out-of-training data—do the parameters predicted by SLT yield f(θ) that are best at generalizing?
Disclaimer: I have a very rudimentary understanding of SLT and may be misrepresenting it.
I don’t think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn’t vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh ε, say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
The reason you have to do some kind of “translation” between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some “resolution” or “scale” ε, i.e. if you discretized the loss function by evaluating it on a lattice with mesh ε you get my picture. Of course, this loses the information of what happens as ε→0 and n→∞ in some thermodynamic limit, which is what you recover when you do SLT.
I just don’t see what this thermodynamic limit tells you about the learning behavior of NNs that we didn’t know before. We already know NNs approximate Solomonoff induction if the A-complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what A looks like as a smooth function as opposed to a discrete function?
In addition, the strong dependence of SLT on A being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that “we can just pretend NNs are analytic” is more pernicious than e.g. “we can just pretend NNs are smooth”. Typical approximation theorems like Stone-Weierstrass also fail to save you because they only work in the sup-norm and that’s completely useless for determining behavior at singularities. So I’m yet to be convinced that the additional details in SLT provide a more useful account of NN learning than my simple description above.
As I read it, the arguments you make in the original post depend only on the macrostate f, which is the same for both the singular and non-singular points of the minimal loss set (in my example), so they can’t distinguish these points at all. I see that you’re also applying the logic to points near the minimal set and arguing that the nearly-optimal points are more abundant near the singularities than near the non-singularities. I think that’s a significant point not made at all in your original point that brings it closer to SLT, so I’d encourage you to add it to the post.
I think there’s also terminology mismatch between your post and SLT. You refer to singularities of A(i.e. its derivative is degenerate) while SLT refers to singularities of the set of minimal loss parameters. The point θ=(0,1) in my example is not singular at all in SLT but A(θ) is singular. This terminology collision makes it sound like you’ve recreated SLT more than you actually have.
I’m not too sure how to respond to this comment because it seems like you’re not understanding what I’m trying to say.
I agree there’s some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you’re trying to represent in some suitable way. If you do this, then you’ll notice that the worse a singularity is, the lower the A-complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization.
The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn’t become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I’m not sure what this claim is supposed to tell us about how NNs learn. I can’t make any novel predictions about NNs with this knowledge that I couldn’t before.
I think the implied claim is something like “analyzing the singularities of the model will also be helpful for understanding SGD in more realistic settings” or maybe just “investigating this area further will lead to insights which are applicable in more realistic settings”. I mostly don’t buy it myself.
This is where we diverge. Please let me know where you think my error is in the following. Returning to my explicit example (though I wrote f(θ) originally but will instead use A(θ) in this post since that matches your definitions).
1. Let f0(x)=0x be the constant zero function and S=A−1(f0).
2. Observe that S is the minimal loss set under our loss function and also S is the set of parameters θ=(θ1,θ2) where θ1=0 or θ2=0.
3. Let α,β∈S . Then A−1(α)=f0=A−1(β) by definition of S. Therefore, c(A(α))=c(A(β)).
4. SLT says that θ=(0,0) is a singularity of S but that θ=(0,1)∈S is not a singularity.
5. Therefore, there exists a singularity (according to SLT) which has identical A-complexity (and also loss) as a non-singular point, contradicting your statement I quote.
You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you’re not going to see any of the interesting behavior SLT is capturing.
In your case, let’s say that we discretize the function space by choosing which one of the functions gk(x)=kηx you’re closest to for some η>0. In addition, we also discretize the codomain of A by looking at the lattice (εZ)2 for some ε>0. Now, you’ll notice that there’s a radius ∼√η disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area ε2 this means the “relative weight” of the singularity at the origin is like O(η/ε2).
In contrast, all other points mapping to the zero function only get a relative weight of O(η/(kε2)) where kε is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all kε>√η to exclude the disk at the origin gives O(√η/ε) for the total contribution of all the other points in the minimum loss set. So in the limit η/ε2→0 the singularity at the origin accounts for almost everything in the preimage of A. The origin is privileged in my picture just as it is in the SLT picture.
I think your mistake is that you’re trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it’s a discretization at a particular scale, it doesn’t capture what happens as the scale is changing. That’s the main shortcoming relative to SLT, but it’s not clear to me how important capturing this thermodynamic-like limit is to begin with.
Again, maybe I’m misrepresenting the actual content of SLT here, but it’s not clear to me what SLT says aside from this, so...
Everything I wrote in steps 1-4 was done in a discrete setting (otherwise |A−1(f0)| is not finite and whole thing falls apart). I was intending θ to be pairs of floating point numbers and A to be floats to floats.
However, using that I think I see what you’re trying to say. Which is that θ1θ2 will equal zero for some cases where θ1 and θ2 are both non-zero but very small and will multiply down to zero due to the limits of floating point numbers. Therefore the pre-image of A−1(f0) is actually larger than I claimed, and specifically contains a small neighborhood of (0,0).
That doesn’t invalidate my calculation that shows that (0,0) is equally likely as (0,1) though: they still have the same loss and A-complexity (since they have the same macrostate). On the other hand, you’re saying that there are points in parameter space that are very close to (0,0) that are also in this same pre-image and also equally likely. Therefore even if (0,0) is just as likely as (0,1), being near to (0,0) is more likely than being near to (0,1). I think it’s fair to say that that is at least qualitatively the same as SLT gives in the continous version of this.
However, I do think this result “happened” due to factors that weren’t discussed in your original post, which makes it sound like it is “due to” A-complexity. A-complexity is a function of the macrostate, which is the same at all of these points and so does not distinguish between (0,0) and (0,1) at all. In other words, your post tells me which f is likely while SLT tells me which θ is likely—these are not the same thing. But you clearly have additional ideas not stated in the post that also help you figure out which θ is likely. Until that is clarified, I think you have a mental theory of this which is very different from what you wrote.
Sure, I agree that I didn’t put this information into the post. However, why do you need to know which θ is more likely to know anything about e.g. how neural networks generalize?
I understand that SLT has some additional content beyond what is in the post, and I’ve tried to explain how you could make that fit in this framework. I just don’t understand why that additional content is relevant, which is why I left it out.
As an additional note, I wasn’t really talking about floating point precision being the important variable here. I’m just saying that if you want A-complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you’re interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don’t know why you would try to do that.
In my view, it’s a significant philosophical difference between SLT and your post that your post talks only about choosing macrostates while SLT talks about choosing microstates. I’m much less qualified to know (let alone explain) the benefits of SLT, though I can speculate. If we stop training after a finite number of steps, then I think it’s helpful to know where it’s converging to. In my example, if you think it’s converging to (0,1), then stopping close to that will get you a function that doesn’t generalize too well. If you know it’s converging to (0,0) then stopping close to that will get you a much better function—possibly exactly equally as good as you pointed out due to discretization.
Now this logic is basically exactly what you’re saying in these comments! But I think if someone read your post without prior knowledge of SLT, they wouldn’t figure out that it’s more likely to converge to a point near (0,0) than near (0,1). If they read an SLT post instead, they would figure that out. In that sense, SLT is more useful.
I am not confident that that is the intended benefit of SLT according to its proponents, though. And I wouldn’t be surprised if you could write a simpler explanation of this in your framework than SLT gives, I just think that this post wasn’t it.