Liam Carroll comments on DSLT 2. Why Neural Networks obey Occam’s Razor

Liam Carroll 11 Jul 2023 23:40 UTC
7 points
0
Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?
Correct! So, the point is that things get interesting when $W_{0}$ is more than just a single point (which is the regular case). In essence, singularities are local minima of $K (w)$ . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of $K (w)$ as a singularity. The TLDR of this is:
$singularities of K (w) = critical points of K (w)$
So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
$K (w) = d \sum i = 1 w_{i}^{2} .$
Anything less than this is a strictly singular case.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for $n \to \infty$ , but I’m not certain of that.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey’s prior. I haven’t studied it in detail but to the best of my reading he is basically saying “from the point of view of SLT, the Jeffrey’s prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be $λ \geq \frac{d}{2}$ if the Jeffrey’s prior is employed.” (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey’s prior is employed).
- Leon Lang 12 Jul 2023 6:26 UTC
  2 points
  0
  Parent
  Thanks for the reply!
  As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. $K (w) = w^{4}$ ). Precisely, suppose $d$ is the number of parameters, then you are in the regular case if $K (w)$ can be expressed as a full-rank quadratic form near each singularity,
  $K (w) = d \sum i = 1 w_{i}^{2} .$
  Anything less than this is a strictly singular case.
  So if $K (w) = w^{2}$ , then $w = 0$ is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.