By universal approximation, these features will be sufficient for any downstream learning task
Right, but trying to fit an unknown function with linear combinations of those features might be extremely data-inefficient, such that it is basically unusable for difficult tasks. Of course you could do better if you’re not restricted to linear combinations—for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you’re not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it’s the efficiency of linear combinations that’s relevant here.
I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.
You originally said that the NTK doesn’t learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:
A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don’t think it’s right to say that the NTK’s lack of feature learning is due to its universal-approximating property.
B) Although the NTK’s representation may be capable of approximating arbitrary functions, it will probably be very slow at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has ‘good’ representations. While it’s universal in one sense, there might be some other sense of ‘universal efficiency’ in which it’s lacking, and where feature-learning algorithms can outperform it.
This is not a trivial question
I agree that in practice there’s likely to be some relationship between universal approximation and efficiency, I just think it’s worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.
Right, but trying to fit an unknown function with linear combinations of those features might be extremely data-inefficient, such that it is basically unusable for difficult tasks. Of course you could do better if you’re not restricted to linear combinations—for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you’re not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it’s the efficiency of linear combinations that’s relevant here.
You originally said that the NTK doesn’t learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:
A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don’t think it’s right to say that the NTK’s lack of feature learning is due to its universal-approximating property.
B) Although the NTK’s representation may be capable of approximating arbitrary functions, it will probably be very slow at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has ‘good’ representations. While it’s universal in one sense, there might be some other sense of ‘universal efficiency’ in which it’s lacking, and where feature-learning algorithms can outperform it.
I agree that in practice there’s likely to be some relationship between universal approximation and efficiency, I just think it’s worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.