The more important aim of this conversion is that now the minima of the term in the exponent, K(w), are equal to 0. If we manage to find a way to express K(w) as a polynomial, this lets us to pull in the powerful machinery of algebraic geometry, which studies the zeros of polynomials. We’ve turned our problem of probability theory and statistics into a problem of algebra and geometry.
Wait… but K(w) just isn’t a polynomial most of the time. Right? From its definition above, K(w) differs by a constant from the log-likelihood L(w). So the log-likelihood has to be a polynomial too? If the network has, say, a ReLU layer, then I wouldn’t even expect L(w) to be smooth. And I can’t see any reason to think that tanh or swishes or whatever else we use would make L(w) happen to be a polynomial either.
Wait… but K(w) just isn’t a polynomial most of the time. Right? From its definition above, K(w) differs by a constant from the log-likelihood L(w). So the log-likelihood has to be a polynomial too? If the network has, say, a ReLU layer, then I wouldn’t even expect L(w) to be smooth. And I can’t see any reason to think that tanh or swishes or whatever else we use would make L(w) happen to be a polynomial either.