Thanks for this, this is really exciting! I am especially interested to hear pushback against it, because I think it is evidence for short timelines.
Moreover, most functions that fit a given set of training data will not generalise well to new data.
Can you say more about the sense in which this is true? There’s a sense in which it is not true, IIRC: If you use the number line between 0 and 1 to code for functions, such that every possible function is a number on that line somewhere, and you bundle equivalent functions together (i.e. functions that give the same output on all the relevant data) then simpler functions will have more measure. This is how Solomonoff Induction works I think. And insofar as we think solomonoff induction would in principle generalize well to new data, then it seems that there’s a sense in which most functions that fit a given set of training data would generalize well to new data.
Yes, as Adam Shimi suggested below, I’m here talking about “input-output relations” when I say “functions”.
The claim can be made formal in a few different ways, but one way is this: let’s say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).
We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).
Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff—if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)
The comparison with Solomonoff induction strikes me as wrong, because Solomonoff induction adapts to new data, whereas “most functions that fit a given set of training data will not generalise well to new data.” generally means that the function is learned and thus fixed when given new data.
As for why most functions will not generalize well, it’s probably because there’s far more wrong answers than correct ones for each new datapoint, in general.
SI is bayesianism + universal prior, IIRC. The way that solomonoff induction generalizes to new data is simply that it takes the prior and cuts out all the functions that contradict the new data. (i.e. it does bayesian update). So insofar as it generalizes well, it’s because it contains functions in the original universal prior that generalize well, and they have relatively high prior compared to the functions that don’t. And the reason this is true is that the prior is based on complexity, and when you dig into the details you encounter the argument I just sketched, where the prior is actually based on weighting each function equally but proving that simpler functions are more numerous.
It’s true that there are more wrong answers than right answers, but (I claim) there are more functions that continue the sequence by giving the right answer than functions that continue the sequence by giving a wrong answer. (At least, assuming the right answer is simple, or rather the result of simple laws, which we assume it is if we think that solomonoff induction generalizes well.)
Okay, I think a confusion here is that I (and the OP AFAIK) don’t talk about the same things as you do when using the word “function”. Solomonoff induction is about programs (usually Turing machines) and from your comment it seems like the sense of functions you’re taking. But functions as I’m using here (and I’m pretty sure this is the meaning in the quote) is just an input/output relation. So it doesn’t make sense to say that two functions are equivalent (in the way your talking about at least), because they necessarily differ on some input (or they would be the same function). On the other hand, two programs can be equivalent in that they output the same things (they compute similar functions).
So from the input/output standpoint, if functions are coded in a line, the correct function is a single point. In that sense there are indeed far more functions that generalize poorly than ones that generalize well (this becomes a bit more complicated when you consider how well your function generalize, but you still generally have way more possibility to do wrong for each data point than to have the unique right answer).
I think this makes sense for clarifying the quote, as this part of the post explains the relatively classic arguments that neural networks generalize pretty well, are not really limited in terms of input/output relations, yet most input/output relations will not generalize correctly to new data. So there must be something more here.
Ahhhh, yes, thank you that hits the nail on the head!
So I guess my original question has been answered, and I’m left with a new question about whether the analogy to solomonoff induction might be useful here: Simpler functions get more weight in the universal prior because there are more programs that compute them; perhaps simpler functions get more weight in neural network’s implicit prior because there are more parameter-settings that compute them (i.e. bigger region of parameter-space) and maybe both of these facts are true for the same reason.
Thanks for this, this is really exciting! I am especially interested to hear pushback against it, because I think it is evidence for short timelines.
Can you say more about the sense in which this is true? There’s a sense in which it is not true, IIRC: If you use the number line between 0 and 1 to code for functions, such that every possible function is a number on that line somewhere, and you bundle equivalent functions together (i.e. functions that give the same output on all the relevant data) then simpler functions will have more measure. This is how Solomonoff Induction works I think. And insofar as we think solomonoff induction would in principle generalize well to new data, then it seems that there’s a sense in which most functions that fit a given set of training data would generalize well to new data.
Yes, as Adam Shimi suggested below, I’m here talking about “input-output relations” when I say “functions”.
The claim can be made formal in a few different ways, but one way is this: let’s say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).
We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).
Thanks! Yeah, I just forgot the definition of a function, sorry. I was thinking of computer programs. Simple functions are expressed by more computer programs than complex functions. (Which is why I was already inclined to believe your conclusion before reading your stuff—if neural nets are like computer programs, then it is plausible that simple functions are expressed by more neural net initializations)
The comparison with Solomonoff induction strikes me as wrong, because Solomonoff induction adapts to new data, whereas “most functions that fit a given set of training data will not generalise well to new data.” generally means that the function is learned and thus fixed when given new data.
As for why most functions will not generalize well, it’s probably because there’s far more wrong answers than correct ones for each new datapoint, in general.
SI is bayesianism + universal prior, IIRC. The way that solomonoff induction generalizes to new data is simply that it takes the prior and cuts out all the functions that contradict the new data. (i.e. it does bayesian update). So insofar as it generalizes well, it’s because it contains functions in the original universal prior that generalize well, and they have relatively high prior compared to the functions that don’t. And the reason this is true is that the prior is based on complexity, and when you dig into the details you encounter the argument I just sketched, where the prior is actually based on weighting each function equally but proving that simpler functions are more numerous.
It’s true that there are more wrong answers than right answers, but (I claim) there are more functions that continue the sequence by giving the right answer than functions that continue the sequence by giving a wrong answer. (At least, assuming the right answer is simple, or rather the result of simple laws, which we assume it is if we think that solomonoff induction generalizes well.)
Okay, I think a confusion here is that I (and the OP AFAIK) don’t talk about the same things as you do when using the word “function”. Solomonoff induction is about programs (usually Turing machines) and from your comment it seems like the sense of functions you’re taking. But functions as I’m using here (and I’m pretty sure this is the meaning in the quote) is just an input/output relation. So it doesn’t make sense to say that two functions are equivalent (in the way your talking about at least), because they necessarily differ on some input (or they would be the same function). On the other hand, two programs can be equivalent in that they output the same things (they compute similar functions).
So from the input/output standpoint, if functions are coded in a line, the correct function is a single point. In that sense there are indeed far more functions that generalize poorly than ones that generalize well (this becomes a bit more complicated when you consider how well your function generalize, but you still generally have way more possibility to do wrong for each data point than to have the unique right answer).
I think this makes sense for clarifying the quote, as this part of the post explains the relatively classic arguments that neural networks generalize pretty well, are not really limited in terms of input/output relations, yet most input/output relations will not generalize correctly to new data. So there must be something more here.
Does that make sense to you?
Ahhhh, yes, thank you that hits the nail on the head!
So I guess my original question has been answered, and I’m left with a new question about whether the analogy to solomonoff induction might be useful here: Simpler functions get more weight in the universal prior because there are more programs that compute them; perhaps simpler functions get more weight in neural network’s implicit prior because there are more parameter-settings that compute them (i.e. bigger region of parameter-space) and maybe both of these facts are true for the same reason.