NTK doesn’t learn features because the feature class at initialization is a universal class
I’ve never heard of any result suggesting this, what’s your argument? I suspect the opposite—by the central limit theorem the partial derivatives and activations at each layer tend toward samples from a fixed distribution(differing per layer but fixed across neurons). I think this means that the NTK embedding is ‘essentially finite’ and actually not universal(though I’m not sure). Note that to show universality it’s not enough to show that all embeddings can be found, you’ll also need an argument showing that their density in the NTK embedding is bounded above zero.
There’s a big difference between ‘universal learner’ and ‘fits any smooth function on a fixed input space’. The ‘universal learner’ property is about data efficiency: do you have bounded regret compared to any learning algorithm in some wide class? Solomonoff induction has this property with respect to computable predictors on binary strings, for instance. There are lots of learning algorithms able to fit any finite binary sequence but which are not universal. I haven’t seen a good formalism for this in the neural net case, but I think it would involve letting the input dimension increase with the number of data points, and comparing the asymptotic performance of various algorithms.
Ah, rereading your original comment more carefully I see that you indeed didn’t say anything about ‘universal learning’. You’re quite right that the NTK is a universal function approximator. My apologies.
However, I still disagree that this is the reason that the NTK doesn’t learn features. I think that ‘universal function approximation’ and ‘feature learning’ are basically unrelated dimensions along which a learning algorithm can vary. That is, it’s quite possible to imagine a learning algorithm which constructs a sequence of different embeddings, all of which are universal approximators. The paper by Greg Yang I linked gives an example of such an algorithm(I don’t think he explicitly proves this but I’m pretty sure it’s true)
What I was trying to get at with the ‘universal learning’ remarks is that, although the NTK does indeed contain all finite embeddings, I believe that it does not do so in a very efficient way—it might require disproportionately many training points to pick out what are, intuitively, fairly simple embeddings. I believe this is what is behind the poor performance of empirical NTKs compared to SGD-trained nets, as I brought up in this comment, and ultimately explains why algorithms that do ‘feature learning’ can outperform those that don’t—the feature learning algorithms are able to find more efficient embeddings for a given set of inputs(of course, it’s possible to imagine a fixed embedding that’s ‘optimally efficient’ in some way, but as far as I’m aware the NTK has no such property). This issue of ‘embedding efficiency’ seems only loosely related to the universal approximation property. To formalize this, it would be nice to develop a theory of universal inference in the setting of classification problems akin to Solomonoff induction. To effectively model this in an asymptotic theory, I think it might be necessary to increase the dimension of the model input along with the number of data points, since otherwise all universal approximators for a given dimension will have asymptotically the same performance. Everything in this paragraph is just my personal speculation though, as far as I’m aware there’s no existing theory of universal inference in classification problems, so if you found my remarks confusing that’s pretty understandable :)
By universal approximation, these features will be sufficient for any downstream learning task
Right, but trying to fit an unknown function with linear combinations of those features might be extremely data-inefficient, such that it is basically unusable for difficult tasks. Of course you could do better if you’re not restricted to linear combinations—for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you’re not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it’s the efficiency of linear combinations that’s relevant here.
I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.
You originally said that the NTK doesn’t learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:
A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don’t think it’s right to say that the NTK’s lack of feature learning is due to its universal-approximating property.
B) Although the NTK’s representation may be capable of approximating arbitrary functions, it will probably be very slow at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has ‘good’ representations. While it’s universal in one sense, there might be some other sense of ‘universal efficiency’ in which it’s lacking, and where feature-learning algorithms can outperform it.
This is not a trivial question
I agree that in practice there’s likely to be some relationship between universal approximation and efficiency, I just think it’s worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.
I’ve never heard of any result suggesting this, what’s your argument? I suspect the opposite—by the central limit theorem the partial derivatives and activations at each layer tend toward samples from a fixed distribution(differing per layer but fixed across neurons). I think this means that the NTK embedding is ‘essentially finite’ and actually not universal(though I’m not sure). Note that to show universality it’s not enough to show that all embeddings can be found, you’ll also need an argument showing that their density in the NTK embedding is bounded above zero.
[Deleted]
There’s a big difference between ‘universal learner’ and ‘fits any smooth function on a fixed input space’. The ‘universal learner’ property is about data efficiency: do you have bounded regret compared to any learning algorithm in some wide class? Solomonoff induction has this property with respect to computable predictors on binary strings, for instance. There are lots of learning algorithms able to fit any finite binary sequence but which are not universal. I haven’t seen a good formalism for this in the neural net case, but I think it would involve letting the input dimension increase with the number of data points, and comparing the asymptotic performance of various algorithms.
[Deleted]
Ah, rereading your original comment more carefully I see that you indeed didn’t say anything about ‘universal learning’. You’re quite right that the NTK is a universal function approximator. My apologies.
However, I still disagree that this is the reason that the NTK doesn’t learn features. I think that ‘universal function approximation’ and ‘feature learning’ are basically unrelated dimensions along which a learning algorithm can vary. That is, it’s quite possible to imagine a learning algorithm which constructs a sequence of different embeddings, all of which are universal approximators. The paper by Greg Yang I linked gives an example of such an algorithm(I don’t think he explicitly proves this but I’m pretty sure it’s true)
What I was trying to get at with the ‘universal learning’ remarks is that, although the NTK does indeed contain all finite embeddings, I believe that it does not do so in a very efficient way—it might require disproportionately many training points to pick out what are, intuitively, fairly simple embeddings. I believe this is what is behind the poor performance of empirical NTKs compared to SGD-trained nets, as I brought up in this comment, and ultimately explains why algorithms that do ‘feature learning’ can outperform those that don’t—the feature learning algorithms are able to find more efficient embeddings for a given set of inputs(of course, it’s possible to imagine a fixed embedding that’s ‘optimally efficient’ in some way, but as far as I’m aware the NTK has no such property). This issue of ‘embedding efficiency’ seems only loosely related to the universal approximation property. To formalize this, it would be nice to develop a theory of universal inference in the setting of classification problems akin to Solomonoff induction. To effectively model this in an asymptotic theory, I think it might be necessary to increase the dimension of the model input along with the number of data points, since otherwise all universal approximators for a given dimension will have asymptotically the same performance. Everything in this paragraph is just my personal speculation though, as far as I’m aware there’s no existing theory of universal inference in classification problems, so if you found my remarks confusing that’s pretty understandable :)
[Deleted]
Right, but trying to fit an unknown function with linear combinations of those features might be extremely data-inefficient, such that it is basically unusable for difficult tasks. Of course you could do better if you’re not restricted to linear combinations—for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you’re not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it’s the efficiency of linear combinations that’s relevant here.
You originally said that the NTK doesn’t learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:
A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don’t think it’s right to say that the NTK’s lack of feature learning is due to its universal-approximating property.
B) Although the NTK’s representation may be capable of approximating arbitrary functions, it will probably be very slow at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has ‘good’ representations. While it’s universal in one sense, there might be some other sense of ‘universal efficiency’ in which it’s lacking, and where feature-learning algorithms can outperform it.
I agree that in practice there’s likely to be some relationship between universal approximation and efficiency, I just think it’s worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.