The comparison with Solomonoff induction strikes me as wrong, because Solomonoff induction adapts to new data, whereas “most functions that fit a given set of training data will not generalise well to new data.” generally means that the function is learned and thus fixed when given new data.
As for why most functions will not generalize well, it’s probably because there’s far more wrong answers than correct ones for each new datapoint, in general.
SI is bayesianism + universal prior, IIRC. The way that solomonoff induction generalizes to new data is simply that it takes the prior and cuts out all the functions that contradict the new data. (i.e. it does bayesian update). So insofar as it generalizes well, it’s because it contains functions in the original universal prior that generalize well, and they have relatively high prior compared to the functions that don’t. And the reason this is true is that the prior is based on complexity, and when you dig into the details you encounter the argument I just sketched, where the prior is actually based on weighting each function equally but proving that simpler functions are more numerous.
It’s true that there are more wrong answers than right answers, but (I claim) there are more functions that continue the sequence by giving the right answer than functions that continue the sequence by giving a wrong answer. (At least, assuming the right answer is simple, or rather the result of simple laws, which we assume it is if we think that solomonoff induction generalizes well.)
Okay, I think a confusion here is that I (and the OP AFAIK) don’t talk about the same things as you do when using the word “function”. Solomonoff induction is about programs (usually Turing machines) and from your comment it seems like the sense of functions you’re taking. But functions as I’m using here (and I’m pretty sure this is the meaning in the quote) is just an input/output relation. So it doesn’t make sense to say that two functions are equivalent (in the way your talking about at least), because they necessarily differ on some input (or they would be the same function). On the other hand, two programs can be equivalent in that they output the same things (they compute similar functions).
So from the input/output standpoint, if functions are coded in a line, the correct function is a single point. In that sense there are indeed far more functions that generalize poorly than ones that generalize well (this becomes a bit more complicated when you consider how well your function generalize, but you still generally have way more possibility to do wrong for each data point than to have the unique right answer).
I think this makes sense for clarifying the quote, as this part of the post explains the relatively classic arguments that neural networks generalize pretty well, are not really limited in terms of input/output relations, yet most input/output relations will not generalize correctly to new data. So there must be something more here.
Ahhhh, yes, thank you that hits the nail on the head!
So I guess my original question has been answered, and I’m left with a new question about whether the analogy to solomonoff induction might be useful here: Simpler functions get more weight in the universal prior because there are more programs that compute them; perhaps simpler functions get more weight in neural network’s implicit prior because there are more parameter-settings that compute them (i.e. bigger region of parameter-space) and maybe both of these facts are true for the same reason.
The comparison with Solomonoff induction strikes me as wrong, because Solomonoff induction adapts to new data, whereas “most functions that fit a given set of training data will not generalise well to new data.” generally means that the function is learned and thus fixed when given new data.
As for why most functions will not generalize well, it’s probably because there’s far more wrong answers than correct ones for each new datapoint, in general.
SI is bayesianism + universal prior, IIRC. The way that solomonoff induction generalizes to new data is simply that it takes the prior and cuts out all the functions that contradict the new data. (i.e. it does bayesian update). So insofar as it generalizes well, it’s because it contains functions in the original universal prior that generalize well, and they have relatively high prior compared to the functions that don’t. And the reason this is true is that the prior is based on complexity, and when you dig into the details you encounter the argument I just sketched, where the prior is actually based on weighting each function equally but proving that simpler functions are more numerous.
It’s true that there are more wrong answers than right answers, but (I claim) there are more functions that continue the sequence by giving the right answer than functions that continue the sequence by giving a wrong answer. (At least, assuming the right answer is simple, or rather the result of simple laws, which we assume it is if we think that solomonoff induction generalizes well.)
Okay, I think a confusion here is that I (and the OP AFAIK) don’t talk about the same things as you do when using the word “function”. Solomonoff induction is about programs (usually Turing machines) and from your comment it seems like the sense of functions you’re taking. But functions as I’m using here (and I’m pretty sure this is the meaning in the quote) is just an input/output relation. So it doesn’t make sense to say that two functions are equivalent (in the way your talking about at least), because they necessarily differ on some input (or they would be the same function). On the other hand, two programs can be equivalent in that they output the same things (they compute similar functions).
So from the input/output standpoint, if functions are coded in a line, the correct function is a single point. In that sense there are indeed far more functions that generalize poorly than ones that generalize well (this becomes a bit more complicated when you consider how well your function generalize, but you still generally have way more possibility to do wrong for each data point than to have the unique right answer).
I think this makes sense for clarifying the quote, as this part of the post explains the relatively classic arguments that neural networks generalize pretty well, are not really limited in terms of input/output relations, yet most input/output relations will not generalize correctly to new data. So there must be something more here.
Does that make sense to you?
Ahhhh, yes, thank you that hits the nail on the head!
So I guess my original question has been answered, and I’m left with a new question about whether the analogy to solomonoff induction might be useful here: Simpler functions get more weight in the universal prior because there are more programs that compute them; perhaps simpler functions get more weight in neural network’s implicit prior because there are more parameter-settings that compute them (i.e. bigger region of parameter-space) and maybe both of these facts are true for the same reason.