Leon Lang comments on DSLT 2. Why Neural Networks obey Occam’s Razor

Leon Lang 25 Jun 2023 22:46 UTC
2 points
0
Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it.
Fundamentally, we care about the free energy $F_{n} = - log Z_{n}$ because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.
Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn’t locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as
$p (w | D_{n}) := \frac{1}{Z_{n}} φ (w) e^{- n L_{n} (w)}$
and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high $Z_{n}$ , meaning the prior and likelihood have to “work quite a bit” to ensure that this expression overall integrates to $1$ . Are you claiming that most of that work happens very localized in a small parameter region?
Additionally, I am not quite sure what you mean with “it tells us something about the information geometry of the posterior”, or even what you mean by “information geometry” here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I’m not sure if that’s what you are referring to here. More explicit references would be useful to me.
Since there is a correspondence
$Small F_{n} (W) ⟺ Large posterior concentration \int_{W} p (w | D_{n}) d w,$
we say the posterior prefers a region $W$ when it has low free energy relative to other regions of $W$ .
Note to other readers (as this wasn’t clear to me immediately): That correspondence holds because one can show that
$\int_{W} p (w | D_{n}) = \frac{1}{Z_{n}} e^{- F_{n} (W)} .$
Here, $Z_{n}$ is the global partition function.
The Bayes generalisation loss is then given by
$G_{n} = E_{X} [- log p (y | x, D_{n})] = - \iint_{R^{N + M}} q (y, x) log p (y | x, D_{n}) d x d y .$
I believe the first expression should be an expectation over $X Y$ .
It follows immediately that the generalisation loss of a region $W \subseteq W$ is
$G_{n} (W) = E_{X_{n + 1}} [F_{n + 1} (W)] - F_{n} (W) .$
I didn’t find a definition of the left expression.
So, the region in $W$ that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam’s Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Purposefully naive question: can I just choose a region $W$ that contains all singularities? Then it surely wins, but this doesn’t help us because this region can be very large.
So I guess you also want to choose small regions. You hinted at that already by saying that $W$ should be compact. But now I of course wonder if sometimes just all of $W_{0}$ lies within a compact set.
There are two singularities in the set of true parameters,
$W_{0} = {- 1, 1},$
which we will label as $w_{- 1}^{(0)}$ and $w_{1}^{(0)}$ respectively.
Possible correction: one of those points isn’t a singularity, but a regular loss-minimizing point (as you also clarify further below).
Let’s consider a one parameter model $d = 1$ with KL divergence defined by
$K (w) = (w + 1)^{2} (w - 1)^{4},$
on the region $W = [- 2, 2]$ with uniform prior $φ (w) = \frac{1}{4} 1 (w \in W)$
The prior seems to do some work here: if it doesn’t properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn’t “manage to get out of the right valley”, I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
- Liam Carroll 27 Jun 2023 6:30 UTC
  6 points
  0
  Parent
  Can you tell more about why it is a measure of posterior concentration.
  ...
  Are you claiming that most of that work happens very localized in a small parameter region?
  Given a small neighbourhood $W \subset W$ , the free energy is $F_{n} (W) = - log Z_{n} (W)$ and $Z_{n} (W)$ measures the posterior concentration in $W$ since
  $Z_{n} (W) = \int_{W} e^{- n L_{n} (w)} φ (w) d w$
  where the inner term is the posterior, modulo its normalisation constant $Z_{n}$ . The key here is that if we are comparing different regions of parameter space $W$ , then the free energy doesn’t care about that normalisation constant as it is just a shift in $F_{n} (W)$ by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions $W_{1}, W_{2}$ are the same “size”. Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn’t really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of $K (w)$ around the singularity).
  I am not quite sure what you mean with “it tells us something about the information geometry of the posterior”
  This is sloppily written by me, apologies. I merely mean to say “the free energy tells us what models the posterior likes”.
  $G_{n} (W) = E_{X_{n + 1}} [F_{n + 1} (W)] - F_{n} (W) .$
  I didn’t find a definition of the left expression.
  I mean, the relation between $G_{n}$ and $F_{n}$ tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to $W$ (instead of $W$ in the integral).
  Purposefully naive question: can I just choose a region $W$ that contains all singularities? Then it surely wins, but this doesn’t help us because this region can be very large.
  Yes—and as you say, this would be very uninteresting (and in general you wouldn’t know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of $W_{0}$ in DSLT3]). The point is that at no point are you just magically “choosing” a $W$ anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry $K (w)$ depends on the singularity structure and this varies across parameter space.
  Possible correction: one of those points isn’t a singularity, but a regular loss-minimizing point (as you also clarify further below).
  As discussed in the comment in your DSLT1 question, they are both singularities of $K (w)$ since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT $\frac{1}{2}$ .
  How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
  I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a “regularisation” term (and can be thought of as adding the “simple harmonic oscillator” part to the story). The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the $n \to \infty$ limit. Also if you were concerned with the requirement for $W$ to be compact, you can just define it to be compact on the space of “numbers that my computer can deal with”.
  Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn’t “manage to get out of the right valley”, I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
  Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells $K (w) = (w + 1)^{2} (w - 1)^{4}$ , you find that it just gets stuck in the basin it was closest to. So, what’s going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it’s very much an open problem.
  - Leon Lang 3 Jul 2023 23:13 UTC
    2 points
    0
    Parent
    Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
    As discussed in the comment in your DSLT1 question, they are both singularities of $K (w)$ since they are both critical points (local minima).
    Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in $f (x, y) = x y$ , the zero set consists of the two axes, which form an algebraic variety, but only at $(0, 0)$ is there a singularity because the derivative disappears.
    Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?
    So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
    The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the $n \to \infty$ limit.
    I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for $n \to \infty$ , but I’m not certain of that.
    - Liam Carroll 11 Jul 2023 23:40 UTC
      7 points
      0
      Parent
      Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?
      Correct! So, the point is that things get interesting when $W_{0}$ is more than just a single point (which is the regular case). In essence, singularities are local minima of $K (w)$ . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of $K (w)$ as a singularity. The TLDR of this is:
      $singularities of K (w) = critical points of K (w)$
      So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
      As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
      $K (w) = d \sum i = 1 w_{i}^{2} .$
      Anything less than this is a strictly singular case.
      I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for $n \to \infty$ , but I’m not certain of that.
      Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey’s prior. I haven’t studied it in detail but to the best of my reading he is basically saying “from the point of view of SLT, the Jeffrey’s prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be $λ \geq \frac{d}{2}$ if the Jeffrey’s prior is employed.” (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey’s prior is employed).
      - Leon Lang 12 Jul 2023 6:26 UTC
        2 points
        0
        Parent
        Thanks for the reply!
        As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
        $K (w) = d \sum i = 1 w_{i}^{2} .$
        Anything less than this is a strictly singular case.
        So if $K (w) = w^{2}$ , then $w = 0$ is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.