Rohin Shah comments on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Rohin Shah 28 Feb 2021 17:25 UTC
LW: 2 AF: 2
AF
I agree it’s not vacuous. It sounds like you’re mostly stating the same argument I gave but in different words. Can you tell me what’s wrong or missing from my summary of the argument?
Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.
(Even if you are talking about the overparameterized case, where this argument is not vacuous and also applies primarily to neural nets and not other ML models, I don’t find this argument very compelling a priori, though I agree that based on empirical evidence it is probably mostly correct.)
- Joar Skalse 3 Mar 2021 14:46 UTC
  LW: 3 AF: 3
  AF Parent
  I agree with your summary. I’m mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.
  I do however disagree that those arguments can be applied to “literally any machine learning algorithm”, although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don’t think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the double-descent phenomenon as you increase the size of a model (at small sizes VC-dimensionality mechanisms dominate, but at larger sizes the overparameterisation starts to induce a simplicity bias, which eventually starts to dominate). Since you get double descent in the model size for both neural networks and eg random forests, you should expect there to be some mechanism in common between them (even if the details of course differ from case to case).