I wonder if this is a neural network thing, an SGD thing, or a both thing? I would love to see what happens when you swap out SGD for something like HMC, NUTS or ATMC if we’re resource constrained. If we still see the same effects then that tells us that this is because of the distribution of functions that neural networks represent, since we’re effectively drawing samples from an approximation to the posterior. Otherwise, it would mean that SGD is plays a role.
what exactly are the magical inductive biases of modern ML that make interpolation work so well?
Are you aware of this work and the papers they cite?
From the abstract:
We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2{\pi} log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Pérez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.
I would field the hypothesis that large volumes of neural network space are devoted to functions that are similar to functions with low K-complexity, and small volumes of NN-space are devoted to functions that are similar to high K-complexity functions. Leading to a Solomonoff-like prior over functions.
I wonder if this is a neural network thing, an SGD thing, or a both thing?
Neither, actually—it’s more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression.
Are you aware of this work and the papers they cite?
Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”
I wonder if this is a neural network thing, an SGD thing, or a both thing? I would love to see what happens when you swap out SGD for something like HMC, NUTS or ATMC if we’re resource constrained. If we still see the same effects then that tells us that this is because of the distribution of functions that neural networks represent, since we’re effectively drawing samples from an approximation to the posterior. Otherwise, it would mean that SGD is plays a role.
Are you aware of this work and the papers they cite?
From the abstract:
I would field the hypothesis that large volumes of neural network space are devoted to functions that are similar to functions with low K-complexity, and small volumes of NN-space are devoted to functions that are similar to high K-complexity functions. Leading to a Solomonoff-like prior over functions.
Neither, actually—it’s more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression.
Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”