Yes, as Adam Shimi suggested below, I’m here talking about “input-output relations” when I say “functions”.
The claim can be made formal in a few different ways, but one way is this: let’s say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).
We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).
Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff—if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)
Yes, as Adam Shimi suggested below, I’m here talking about “input-output relations” when I say “functions”.
The claim can be made formal in a few different ways, but one way is this: let’s say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).
We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).
Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff—if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)