Jesse Hoogland comments on My Criticism of Singular Learning Theory

Jesse Hoogland 21 Nov 2023 22:35 UTC
LW: 5 AF: 2
2
AF
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by

$f (x) = θ_{1} + θ_{2} θ_{3} x + θ_{4} θ_{5} θ_{6} x^{2} + θ_{7} θ_{8} θ_{9} θ_{10} x^{3} + θ_{11} θ_{12} θ_{13} θ_{14} θ_{15} x^{4},$
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials.
I’m not sure, but I think this example is pathological. One possible reason for this to be the case is that the symmetries in this model are entirely “generic” or “global.” The more interesting kinds of symmetry are “nongeneric” or “local.”

What I mean by “global” is that each point $θ = {θ_{i}}_{i = 1}^{15}$ in the parameter space $Θ$ has the same set of symmetries (specifically, the product of a bunch of hyperboloids $θ_{2} θ_{3} = c_{1}, θ_{4} θ_{5} θ_{6} = c_{2}, \dots$ ). In neural networks there are additional symmetries that are only present for a subset of the weights. My favorite example of this is the decision boundary annihilation (see below).
For the sake of simplicity, consider a ReLU network learning a 1D function (which is just piecewise linear approximation). Consider what happens when you you rotate two adjacent pieces so they end up sitting on the same line, thereby “annihilating” the decision boundary between them, so this now-hidden decision boundary no longer contributes to your function. You can move this decision boundary along the composite linear piece without changing the learned function, but this only holds until you reach the next decision boundary over. I.e.: this symmetry is local. (Note that real-world networks actually seem to take advantage of this property.)
This is the more relevant and interesting kind of symmetry, and it’s easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies. We expect this to be true much more generally — that algorithmic primitives like conditional statements, for loops, composition, etc. have clear geometric traces in the loss landscape.
So what we’re really interested in is something more like the relative RLCT (to the model class’s maximum RLCT). This is also the relevant quantity from a dynamical perspective: it’s relative loss and complexity that dictate transitions, not absolute loss or complexity.
This gets at another point you raised:
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
You can make the same critique of Kolmogorov complexity. Kolmogorov complexity is defined relative to some base UTM. Fixing a UTM lets you set an arbitrary constant correction. What’s really interesting is the relative Kolmogorov complexity.
In the case of NNs, the model class is akin to your UTM, and, as you show, you can engineer the model class (by setting generic symmetries) to achieve any constant correction to the model complexity. But those constant corrections are not the interesting bit. The interesting bit is the question of relative complexities. I expect that you can make a statement similar to the equivalence-up-to-a-constant of Kolmogorov complexity for RLCTs. Wild conjecture: given two model classes $W_{1}$ and $W_{2}$ and some true distribution $q (x)$ , their RLCTs obey:
$- c < f (λ_{1} (q)) < f (λ_{2} (q)) < c,$
where $f$ is some monotonic function.
- Joar Skalse 22 Nov 2023 11:21 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I’m not sure, but I think this example is pathological.
  Yes, it’s artificial and cherry-picked to make a certain rhetorical point as simply as possible.
  This is the more relevant and interesting kind of symmetry, and it’s easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨
  This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
  You can make the same critique of Kolmogorov complexity.
  Yes, I have been using “Kolmogorov complexity” in a somewhat loose way here.
  Wild conjecture: [...]
  Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that $Θ$ is compact, etc.)
  - Jesse Hoogland 22 Nov 2023 21:50 UTC
    LW: 4 AF: 2
    0
    AF Parent
    This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
    Agreed. So maybe what I’m actually trying to get at it is a statement about what “universality” means in the context of neural networks. Just as the microscopic details of physical theories don’t matter much to their macroscopic properties in the vicinity of critical points (“universality” in statistical physics), just as the microscopic details of random matrices don’t seem to matter for their bulk and edge statistics (“universality” in random matrix theory), many of the particular choices of neural network architecture doesn’t seem to matter for learned representations (“universality” in DL).
    What physics and random matrix theory tell us is that a given system’s universality class is determined by its symmetries. (This starts to get at why we SLT enthusiasts are so obsessed with neural network symmetries.) In the case of learning machines, those symmetries are fixed by the parameter-function map, so I totally agree that you need to understand the parameter-function map.
    However, focusing on symmetries is already a pretty major restriction. If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
    There’s another important observation, which is that neural network symmetries leave geometric traces. Even if the RLCT on its own does not “solve” generalization, the SLT-inspired geometric perspective might still hold the answer: it should be possible to distinguish neural networks from the polynomial example you provided by understanding the geometry of the loss landscape. The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
    If that’s the case, my concern about focusing on the parameter-function map is that it would pose a distraction. It could miss the forest for the trees if you’re trying to understand the structure that develops and phenomena like generalization. I expect the more fruitful perspective to remain anchored in geometry.
    Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that $Θ$ is compact, etc.)
    Hmm, maybe restrict $f$ so it has to range over $R$ .
    - Joar Skalse 24 Nov 2023 23:11 UTC
      LW: 1 AF: 1
      0
      AF Parent
      If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
      I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
      The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
      Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).