Rohin Shah comments on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Rohin Shah 20 Feb 2021 21:34 UTC
LW: 6 AF: 5
AF
More explanation on why I’m not summarizing the paper about Kolmogorov complexity:
I don’t find the Kolmogorov complexity stuff very convincing. In general, algorithmic information theory tends to have arguments of the form “since this measure simulates every computable process, it (eventually) at least matches any particular computable process”. This feels pretty different from normal notions of “simplicity” or “intelligence”, and so I often try to specifically taboo phrases like “Solomonoff induction” or “Kolmogorov complexity” and replace them with something like “by simulating every possible computational process”, and see if the argument still seems convincing. That mostly doesn’t seem to be the case here.
If I try to do this with the arguments here, I get something like:
Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.
This argument just doesn’t sound very compelling to me. It also can be applied to literally any machine learning algorithm; I don’t see why this is specific to neural nets. If this is just meant to explain why it is okay to overparameterize neural nets, then that makes more sense to me, though then I’d say something like “with overparameterized neural nets, many different parameterizations instantiate the same function, and so the ‘effective parameterization’ is lower than you might have thought”, rather than saying anything about Kolmogorov complexity.
(This doesn’t capture everything that simplicity is meant to capture—for example, it doesn’t capture the argument that neural networks can express overfit-to-the-training-data models, but those are high complexity and so low likelihood in the prior and so don’t happen in general; but as mentioned above I find the Kolmogorov complexity argument for this pretty tenuous.)
What links here?
- Rohin Shah's comment on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian by Joar Skalse (28 Feb 2021 17:18 UTC; 3 points)
- Joar Skalse 28 Feb 2021 16:30 UTC
  LW: 1 AF: 1
  AF Parent
  This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.
  But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people involved with this work), but I think of the Levin bound argument as essentially a more formal way to state this intuition. I.e., it is a loose argument for why we needn’t be surprised that simple functions have larger measures in parameter-space, even if the argument doesn’t identify all the precise details of how this works in any particular case. For example, while the argument doesn’t apply to all ML algorithms, it does apply to all neural network architectures in exactly the same way, but clearly there are important differences in the inductive bias of eg CNNs and RNNs.
  (Also: I don’t think the notion of “low effective parameterisation” really captures what’s going on here, but for the reason you point out yourself.)
  - Rohin Shah 28 Feb 2021 17:25 UTC
    LW: 2 AF: 2
    AF Parent
    I agree it’s not vacuous. It sounds like you’re mostly stating the same argument I gave but in different words. Can you tell me what’s wrong or missing from my summary of the argument?
    Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.
    (Even if you are talking about the overparameterized case, where this argument is not vacuous and also applies primarily to neural nets and not other ML models, I don’t find this argument very compelling a priori, though I agree that based on empirical evidence it is probably mostly correct.)
    - Joar Skalse 3 Mar 2021 14:46 UTC
      LW: 3 AF: 3
      AF Parent
      I agree with your summary. I’m mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.
      I do however disagree that those arguments can be applied to “literally any machine learning algorithm”, although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don’t think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the double-descent phenomenon as you increase the size of a model (at small sizes VC-dimensionality mechanisms dominate, but at larger sizes the overparameterisation starts to induce a simplicity bias, which eventually starts to dominate). Since you get double descent in the model size for both neural networks and eg random forests, you should expect there to be some mechanism in common between them (even if the details of course differ from case to case).