Frank Seidl comments on Neural networks generalize because of this one weird trick

Frank Seidl 26 Jan 2023 4:33 UTC
7 points
0
The more important aim of this conversion is that now the minima of the term in the exponent, $K (w)$ , are equal to 0. If we manage to find a way to express $K (w)$ as a polynomial, this lets us to pull in the powerful machinery of algebraic geometry, which studies the zeros of polynomials. We’ve turned our problem of probability theory and statistics into a problem of algebra and geometry.
Wait… but $K (w)$ just isn’t a polynomial most of the time. Right? From its definition above, $K (w$ ) differs by a constant from the log-likelihood $L (w)$ . So the log-likelihood has to be a polynomial too? If the network has, say, a ReLU layer, then I wouldn’t even expect $L (w)$ to be smooth. And I can’t see any reason to think that $t a n h$ or swishes or whatever else we use would make $L (w)$ happen to be a polynomial either.
- Jesse Hoogland 26 Jan 2023 18:19 UTC
  4 points
  0
  Parent
  To take a step back, the idea of a Taylor expansion is that we can express any $C^{\infty}$ function as an (infinite) polynomial. If you’re close enough to the point you’re expanding around, then a finite polynomial can be an arbitrarily good fit.
  The central challenge here is that $K (w)$ is pretty much never a polynomial. So the idea is to find a mapping, $g$ , that lets us re-express $w$ in terms of a new coordinate system, $w = g (u)$ . If we do this right, then we can express $K (g (u))$ (locally) as a polynomial in terms of the new coordinates, $u$ .
  What we’re doing here is we’re “fixing” the non-differentiable singularities in $K (w)$ so that we can do a kind of Taylor expansion over the new coordinates. That’s why we have to introduce this new manifold, $U$ , and mapping $g$ .