Jesse Hoogland comments on Neural networks generalize because of this one weird trick

Jesse Hoogland 28 Jan 2023 3:41 UTC
2 points
0
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) $L (x) = a \cdot min ((x - b)^{2}, (y - b)^{2})$ . There’s a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model $p (\cdot | w)$ (or $f_{w} (\cdot)$ if you’re making deterministic predictions) and the true distribution $q (x)$ lead to singularities in the loss landscape $L_{n} (x)$ . There’s an important distinction between $f$ and $L$ .
- tgb 30 Jan 2023 11:59 UTC
  2 points
  0
  Parent
  So that example is of $L$ , what is the $f$ for it? Obviously, there’s multiple $f$ that could give that (depending on how the loss is computed from $f$ ), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of $f$ (which give type B behavior) but instead are talking about symmetries of $L$ (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
  I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are $a, b$ but the plots show it it varying $x, y$ for fixed $a = 1, b = 0$ ? Treating it as written, there’s not actually a singularity in its parameters $a, b$ .
  - Jesse Hoogland 30 Jan 2023 21:31 UTC
    1 point
    0
    Parent
    This is a toy example (I didn’t come up with it for any particular $f$ in mind.
    
    I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
    
    If you’re interested in interpreting the energy landscape as a loss landscape, $x$ and $y$ would be the parameters (and $a$ and $b$ would be hyperparameters related to things like the learning rate and batch size.