Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood W⊂W, the free energy is Fn(W)=−logZn(W) and Zn(W) measures the posterior concentration in W since
Zn(W)=∫We−nLn(w)φ(w)dw
where the inner term is the posterior, modulo its normalisation constant Zn. The key here is that if we are comparing different regions of parameter space W, then the free energy doesn’t care about that normalisation constant as it is just a shift in Fn(W) by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions W1,W2 are the same “size”. Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn’t really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of K(w) around the singularity).
I am not quite sure what you mean with “it tells us something about the information geometry of the posterior”
This is sloppily written by me, apologies. I merely mean to say “the free energy tells us what models the posterior likes”.
Gn(W)=EXn+1[Fn+1(W)]−Fn(W).
I didn’t find a definition of the left expression.
I mean, the relation between Gn and Fn tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to W (instead of W in the integral).
Purposefully naive question: can I just choose a region W that contains all singularities? Then it surely wins, but this doesn’t help us because this region can be very large.
Yes—and as you say, this would be very uninteresting (and in general you wouldn’t know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of W0 in DSLT3]). The point is that at no point are you just magically “choosing” a W anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry K(w) depends on the singularity structure and this varies across parameter space.
Possible correction: one of those points isn’t a singularity, but a regular loss-minimizing point (as you also clarify further below).
As discussed in the comment in your DSLT1 question, they are both singularities of K(w) since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT 12.
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a “regularisation” term (and can be thought of as adding the “simple harmonic oscillator” part to the story). The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the n→∞ limit. Also if you were concerned with the requirement for W to be compact, you can just define it to be compact on the space of “numbers that my computer can deal with”.
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn’t “manage to get out of the right valley”, I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells K(w)=(w+1)2(w−1)4, you find that it just gets stuck in the basin it was closest to. So, what’s going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it’s very much an open problem.
Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
As discussed in the comment in your DSLT1 question, they are both singularities of K(w) since they are both critical points (local minima).
Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in f(x,y)=xy, the zero set consists of the two axes, which form an algebraic variety, but only at (0,0) is there a singularity because the derivative disappears. Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of K, and thus, the derivative disappears at every point in the set W0. This suggests every point in W0 is singular. Is this correct?
So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the n→∞ limit.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for n→∞, but I’m not certain of that.
Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of K, and thus, the derivative disappears at every point in the set W0. This suggests every point in W0 is singular. Is this correct?
Correct! So, the point is that things get interesting when W0 is more than just a single point (which is the regular case). In essence, singularities are local minima of K(w). In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of K(w) as a singularity. The TLDR of this is:
singularities of K(w)=critical points of K(w)
So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. K(w)=w4). Precisely, suppose d is the number of parameters, then you are in the regular case if K(w) can be expressed as a full-rank quadratic form near each singularity,
K(w)=d∑i=1w2i.
Anything less than this is a strictly singular case.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for n→∞, but I’m not certain of that.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey’s prior. I haven’t studied it in detail but to the best of my reading he is basically saying “from the point of view of SLT, the Jeffrey’s prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be λ≥d2 if the Jeffrey’s prior is employed.” (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey’s prior is employed).
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. K(w)=w4). Precisely, suppose d is the number of parameters, then you are in the regular case if K(w) can be expressed as a full-rank quadratic form near each singularity,
K(w)=d∑i=1w2i.
Anything less than this is a strictly singular case.
So if K(w)=w2, then w=0 is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.
Given a small neighbourhood W⊂W, the free energy is Fn(W)=−logZn(W) and Zn(W) measures the posterior concentration in W since
Zn(W)=∫We−nLn(w)φ(w)dwwhere the inner term is the posterior, modulo its normalisation constant Zn. The key here is that if we are comparing different regions of parameter space W, then the free energy doesn’t care about that normalisation constant as it is just a shift in Fn(W) by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions W1,W2 are the same “size”. Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn’t really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of K(w) around the singularity).
This is sloppily written by me, apologies. I merely mean to say “the free energy tells us what models the posterior likes”.
I mean, the relation between Gn and Fn tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to W (instead of W in the integral).
Yes—and as you say, this would be very uninteresting (and in general you wouldn’t know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of W0 in DSLT3]). The point is that at no point are you just magically “choosing” a W anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry K(w) depends on the singularity structure and this varies across parameter space.
As discussed in the comment in your DSLT1 question, they are both singularities of K(w) since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT 12.
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a “regularisation” term (and can be thought of as adding the “simple harmonic oscillator” part to the story). The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the n→∞ limit. Also if you were concerned with the requirement for W to be compact, you can just define it to be compact on the space of “numbers that my computer can deal with”.
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells K(w)=(w+1)2(w−1)4, you find that it just gets stuck in the basin it was closest to. So, what’s going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it’s very much an open problem.
Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in f(x,y)=xy, the zero set consists of the two axes, which form an algebraic variety, but only at (0,0) is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of K, and thus, the derivative disappears at every point in the set W0. This suggests every point in W0 is singular. Is this correct?
So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for n→∞, but I’m not certain of that.
Correct! So, the point is that things get interesting when W0 is more than just a single point (which is the regular case). In essence, singularities are local minima of K(w). In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of K(w) as a singularity. The TLDR of this is:
singularities of K(w)=critical points of K(w)As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. K(w)=w4). Precisely, suppose d is the number of parameters, then you are in the regular case if K(w) can be expressed as a full-rank quadratic form near each singularity,
K(w)=d∑i=1w2i.Anything less than this is a strictly singular case.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey’s prior. I haven’t studied it in detail but to the best of my reading he is basically saying “from the point of view of SLT, the Jeffrey’s prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be λ≥d2 if the Jeffrey’s prior is employed.” (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey’s prior is employed).
Thanks for the reply!
So if K(w)=w2, then w=0 is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.