johnswentworth comments on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

johnswentworth 2 Jan 2021 17:35 UTC
LW: 4 AF: 3
AF
The things Joar is saying there generally match my intuitions, although there are some things unsaid where my intuitions might diverge more.
There’s about a gazillion measures of “simplicity”. If we pick some simplicity measure M, the usual rule is that things with low M-complexity have low Kolmogorov complexity, but things with low Kolmogorov complexity don’t necessarily have low M-complexity. For instance, if a file can be compressed a lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all low Kolmogorov complexity strings can be compressed by ZIP specifically.
In this context, smoothness is one such relevant measure: smooth functions have low Kolmogorov complexity, but there are other ways to have low Kolmogorov complexity without being smooth. I don’t know about the Levin bound specifically, but in math these sorts of theorems are usually about smoothness. In complexity theory in particular, theorems often connect smoothness measures to circuit size measures (which are another class of complexity measures which imply low Kolmogorov complexity but not vice versa).
Roughly speaking, if a complexity measure were as general as Kolmogorov complexity, then we’d expect it to be uncomputable. If we can actually find low-complexity functions under a complexity measure M, then that complexity measure is probably less general than Kolmogorov complexity. From there, the natural next question is exactly what generality is kept/lost, and whether the loss of generality actually matters for things in the real world.
- Joar Skalse 3 Jan 2021 0:32 UTC
  LW: 6 AF: 5
  AF Parent
  Ah, I certainly agree with this.
  I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
  So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
  This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
  What links here?
  - Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”) by Joe Carlsmith (7 Dec 2023 15:05 UTC; 10 points)
  - Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”) by Joe_Carlsmith (EA Forum; 7 Dec 2023 15:05 UTC; 6 points)
  - johnswentworth 3 Jan 2021 1:36 UTC
    LW: 4 AF: 3
    AF Parent
    That’s a clever example, I like it.
    Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it’s just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.
- Daniel Kokotajlo 2 Jan 2021 19:59 UTC
  LW: 2 AF: 1
  AF Parent
  Thanks, that helps. Perhaps an example would be: A purely feed-forward neural network might be “blind” to algorithms that are kolmogorov-simple but which involve repeatedly performing the same procedure a bunch of times (even if it is technically big enough to contain such an algorithm). So the simplicity bias of said network would be importantly different from kolmogorov complexity.
  - johnswentworth 2 Jan 2021 21:05 UTC
    LW: 4 AF: 3
    AF Parent
    That’s exactly right. That exact example would be a case of high circuit size but low Kolmogorov complexity.