tgb comments on Neural networks generalize because of this one weird trick

tgb 27 Jan 2023 11:51 UTC
2 points
0
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting y = f(x) for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then $f_{1} (x) = a b + c x$ has minimal loss set ${a = 0 or b = 0 and c = 0}$ . That has a singularity at (0,0,0). I don’t really see why it would generalize better than $f_{2} (x) = a + c x$ or $f_{3} (x) = a + b + c x$ , neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?

Consider instead $f_{4} (x) = a + b x + c d x^{2}$ . Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to $f_{4} (x) = a + b x + 0$ since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where $f_{4}$ has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
- Jesse Hoogland 28 Jan 2023 3:41 UTC
  2 points
  0
  Parent
  I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) $L (x) = a \cdot min ((x - b)^{2}, (y - b)^{2})$ . There’s a singularity at the origin.
  This does seem like an important point to emphasize: symmetries in the model $p (\cdot | w)$ (or $f_{w} (\cdot)$ if you’re making deterministic predictions) and the true distribution $q (x)$ lead to singularities in the loss landscape $L_{n} (x)$ . There’s an important distinction between $f$ and $L$ .
  - tgb 30 Jan 2023 11:59 UTC
    2 points
    0
    Parent
    So that example is of $L$ , what is the $f$ for it? Obviously, there’s multiple $f$ that could give that (depending on how the loss is computed from $f$ ), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of $f$ (which give type B behavior) but instead are talking about symmetries of $L$ (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
    I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are $a, b$ but the plots show it it varying $x, y$ for fixed $a = 1, b = 0$ ? Treating it as written, there’s not actually a singularity in its parameters $a, b$ .
    - Jesse Hoogland 30 Jan 2023 21:31 UTC
      1 point
      0
      Parent
      This is a toy example (I didn’t come up with it for any particular $f$ in mind.
      
      I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
      
      If you’re interested in interpreting the energy landscape as a loss landscape, $x$ and $y$ would be the parameters (and $a$ and $b$ would be hyperparameters related to things like the learning rate and batch size.