Your point about neural nets NEVER being able to extrapolate is wrong. NNs are universal function approximators. A sufficiently large NN with the right weights can therefore approximate the “extrapolation” function (or even approximate whatever extrapolative model you’re training in place of an NN). I usually wouldn’t bring up this sort of objection since actually learning the right weights is not guaranteed to be feasible with any algorithm and is pretty much assuming away the entire problem, but you said “An algorithm that can’t extrapolate is an algorithm that can’t extrapolate. Period.”, so I think it’s warranted.
Also, I’m pretty sure “extrapolation” is essentially Bayesian:
start with a good prior over the sorts of data distributions you’ll likely encounter
update that prior with your observed data
generate future data from the resulting posterior
There’s nothing in there that NNs are fundamentally incapable of doing.
Finally, your reference to neural ODEs would be more convincing if you’d shown them reaching state if the art results in benchmarks after 2019. There are plenty of methods that do better than NNs when there’s limited data. The reason NNs remain so dominant is that they keep delivering better results as we throw more and more data at them.
Your point about neural nets NEVER being able to extrapolate is wrong. NNs are universal function approximators. A sufficiently large NN with the right weights can therefore approximate the “extrapolation” function (or even approximate whatever extrapolative model you’re training in place of an NN).
I’m pretty sure this is wrong. The universal approximator theorems I’ve seen work by interpolating the function they are fitting; this doesn’t gurantee that they can extrapolate.
In fact it seems to me that a universal approximator theorem can never show that a network is capable of extrapolating? Because the universal approximation has to hold for all possible functions, while extrapolation inherently involves guessing a specific function.
There’s nothing in there that NNs are fundamentally incapable of doing.
The post isn’t talking about fundamental capabilities, it is talking about “today’s methods”.
I was thinking of the NN approximating the “extrapolate” function itself, that thing which takes in partial data and generates an extrapolation. That function is, by assumption, capable of extrapolation from incomplete data. Therefore, I expect a sufficiently precise approximation to that function is able to extrapolate.
It may be helpful to argue from Turing completeness instead. Transformers are Turing complete. If “extrapolation” is Turing computable, then there’s a transformer that implements extrapolation.
Also, the post is talking about what NNs are ever able to do, not just what they can do now. That’s why I thought it was appropriate to bring up theoretical computer science.
I’m implying that the popular ANNs in use today have bad priors. Are you implying that a sufficiently large ANN has good priors or that it can learn good priors?
That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)
Your point about neural nets NEVER being able to extrapolate is wrong. NNs are universal function approximators. A sufficiently large NN with the right weights can therefore approximate the “extrapolation” function (or even approximate whatever extrapolative model you’re training in place of an NN). I usually wouldn’t bring up this sort of objection since actually learning the right weights is not guaranteed to be feasible with any algorithm and is pretty much assuming away the entire problem, but you said “An algorithm that can’t extrapolate is an algorithm that can’t extrapolate. Period.”, so I think it’s warranted.
Also, I’m pretty sure “extrapolation” is essentially Bayesian:
start with a good prior over the sorts of data distributions you’ll likely encounter
update that prior with your observed data
generate future data from the resulting posterior
There’s nothing in there that NNs are fundamentally incapable of doing.
Finally, your reference to neural ODEs would be more convincing if you’d shown them reaching state if the art results in benchmarks after 2019. There are plenty of methods that do better than NNs when there’s limited data. The reason NNs remain so dominant is that they keep delivering better results as we throw more and more data at them.
I’m pretty sure this is wrong. The universal approximator theorems I’ve seen work by interpolating the function they are fitting; this doesn’t gurantee that they can extrapolate.
In fact it seems to me that a universal approximator theorem can never show that a network is capable of extrapolating? Because the universal approximation has to hold for all possible functions, while extrapolation inherently involves guessing a specific function.
The post isn’t talking about fundamental capabilities, it is talking about “today’s methods”.
I was thinking of the NN approximating the “extrapolate” function itself, that thing which takes in partial data and generates an extrapolation. That function is, by assumption, capable of extrapolation from incomplete data. Therefore, I expect a sufficiently precise approximation to that function is able to extrapolate.
It may be helpful to argue from Turing completeness instead. Transformers are Turing complete. If “extrapolation” is Turing computable, then there’s a transformer that implements extrapolation.
Also, the post is talking about what NNs are ever able to do, not just what they can do now. That’s why I thought it was appropriate to bring up theoretical computer science.
I’m implying that the popular ANNs in use today have bad priors. Are you implying that a sufficiently large ANN has good priors or that it can learn good priors?
That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)
ML prof here.
Universal fn approx assumes bounded domain, which basically means it is about interpolation.