Liam Carroll comments on DSLT 2. Why Neural Networks obey Occam’s Razor

Liam Carroll 27 Jun 2023 6:30 UTC
6 points
0
Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood $W \subset W$ , the free energy is $F_{n} (W) = - log Z_{n} (W)$ and $Z_{n} (W)$ measures the posterior concentration in $W$ since
$Z_{n} (W) = \int_{W} e^{- n L_{n} (w)} φ (w) d w$
where the inner term is the posterior, modulo its normalisation constant $Z_{n}$ . The key here is that if we are comparing different regions of parameter space $W$ , then the free energy doesn’t care about that normalisation constant as it is just a shift in $F_{n} (W)$ by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions $W_{1}, W_{2}$ are the same “size”. Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn’t really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of $K (w)$ around the singularity).
I am not quite sure what you mean with “it tells us something about the information geometry of the posterior”
This is sloppily written by me, apologies. I merely mean to say “the free energy tells us what models the posterior likes”.
$G_{n} (W) = E_{X_{n + 1}} [F_{n + 1} (W)] - F_{n} (W) .$
I didn’t find a definition of the left expression.
I mean, the relation between $G_{n}$ and $F_{n}$ tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to $W$ (instead of $W$ in the integral).
Purposefully naive question: can I just choose a region $W$ that contains all singularities? Then it surely wins, but this doesn’t help us because this region can be very large.
Yes—and as you say, this would be very uninteresting (and in general you wouldn’t know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of $W_{0}$ in DSLT3]). The point is that at no point are you just magically “choosing” a $W$ anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry $K (w)$ depends on the singularity structure and this varies across parameter space.
Possible correction: one of those points isn’t a singularity, but a regular loss-minimizing point (as you also clarify further below).
As discussed in the comment in your DSLT1 question, they are both singularities of $K (w)$ since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT $\frac{1}{2}$ .
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a “regularisation” term (and can be thought of as adding the “simple harmonic oscillator” part to the story). The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the $n \to \infty$ limit. Also if you were concerned with the requirement for $W$ to be compact, you can just define it to be compact on the space of “numbers that my computer can deal with”.
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn’t “manage to get out of the right valley”, I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells $K (w) = (w + 1)^{2} (w - 1)^{4}$ , you find that it just gets stuck in the basin it was closest to. So, what’s going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it’s very much an open problem.
- Leon Lang 3 Jul 2023 23:13 UTC
  2 points
  0
  Parent
  Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
  As discussed in the comment in your DSLT1 question, they are both singularities of $K (w)$ since they are both critical points (local minima).
  Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in $f (x, y) = x y$ , the zero set consists of the two axes, which form an algebraic variety, but only at $(0, 0)$ is there a singularity because the derivative disappears.
  Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?
  So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
  The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the $n \to \infty$ limit.
  I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for $n \to \infty$ , but I’m not certain of that.
  - Liam Carroll 11 Jul 2023 23:40 UTC
    7 points
    0
    Parent
    Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?
    Correct! So, the point is that things get interesting when $W_{0}$ is more than just a single point (which is the regular case). In essence, singularities are local minima of $K (w)$ . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of $K (w)$ as a singularity. The TLDR of this is:
    $singularities of K (w) = critical points of K (w)$
    So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
    As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
    $K (w) = d \sum i = 1 w_{i}^{2} .$
    Anything less than this is a strictly singular case.
    I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for $n \to \infty$ , but I’m not certain of that.
    Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey’s prior. I haven’t studied it in detail but to the best of my reading he is basically saying “from the point of view of SLT, the Jeffrey’s prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be $λ \geq \frac{d}{2}$ if the Jeffrey’s prior is employed.” (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey’s prior is employed).
    - Leon Lang 12 Jul 2023 6:26 UTC
      2 points
      0
      Parent
      Thanks for the reply!
      As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
      $K (w) = d \sum i = 1 w_{i}^{2} .$
      Anything less than this is a strictly singular case.
      So if $K (w) = w^{2}$ , then $w = 0$ is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.