Joar Skalse comments on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Joar Skalse 3 Jan 2021 0:32 UTC
LW: 6 AF: 5
AF
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
What links here?
- Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”) by Joe Carlsmith (7 Dec 2023 15:05 UTC; 10 points)
- Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”) by Joe_Carlsmith (EA Forum; 7 Dec 2023 15:05 UTC; 6 points)
- johnswentworth 3 Jan 2021 1:36 UTC
  LW: 4 AF: 3
  AF Parent
  That’s a clever example, I like it.
  Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it’s just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.