Joar Skalse comments on My Criticism of Singular Learning Theory

Joar Skalse 20 Nov 2023 23:48 UTC
6 points
3
A few things:

1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.

My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
- Daniel Murfet 21 Nov 2023 9:02 UTC
  4 points
  1
  Parent
  I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
  - Roman Leventov 21 Nov 2023 22:06 UTC
    3 points
    0
    Parent
    Kolmogorov complexity is definitely a misleading path here, and it’s unfortunate that Joar chose it as the “leading” example of complexity in the post. Note this passage:
    However, they do not give a detailed answer to the question of precisely which complexity measure they minimise—they merely show that this result holds for many different complexity measures. For example, I would expect that fully connected neural networks are biased towards functions with low Boolean circuit complexity, or something very close to that. Verifying this claim, and deriving similar results about other kinds of network architectures, would make it easier to reason about what kinds of functions we should expect a neural network to be likely or unlikely to learn. This would also make it easier to reason about out-of-distribution generalisation, etc.
    This quote from the above comment is better:
    If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
    I’ve expressed this idea with some links here:
    Bayesian Brain theorists further hypothesise that animal brains do effectively implement something like these “simple” algorithms (adjusted to the level of generality and sophistication of the world model each animal species needs) due to the strong evolutionary pressure on energy efficiency of the brain (“The free energy principle induces neuromorphic development”). The speed-accuracy tradeoffs in brain hardware add another kind of pressure that points in the same direction (“Internal feedback in the cortical perception–action loop enables fast and accurate behavior”).
    Then if we combine two claims:
    Joar’s “DNNs are (kind of) Bayesian” (for the reasons that I don’t understand because I didn’t read their papers, so I just take his word here), and
    Fields et al.‘s “brains are ‘almost’ Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient”,
    is this an explanation explanation of DNNs’ remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?
    - Joar Skalse 22 Nov 2023 10:19 UTC
      1 point
      0
      Parent
      Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
  - Joar Skalse 22 Nov 2023 10:09 UTC
    1 point
    0
    Parent
    Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.