Hmm, so regarding the linear combinations, it’s true that there are some linear combinations that will change by Θ(1) in the large-width limit—just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don’t have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute “we expect most linear combinations to change” though—the CLT argument implies that we should expect almost all combinations to not appreciably change. Not sure what effect this would have on the PCA and still think it’s plausible that it doesn’t change at all(actually, I think Greg Yang states that it doesn’t change in section 9 of his paper, haven’t read that part super carefully though)
And the tangent kernel not changing does not imply that transfer learning won’t work
So I think I was a bit careless in saying that the NTK can’t do transfer learning at all—a more exact statement might be “the NTK does the minimal amount of transfer learning possible”. What I mean by this is, any learning algorithm can do transfer learning if the task we are ‘transferring’ to is sufficiently similar to the original task—for instance, if it’s just the exact same task but with a different data sample. I claim that the ‘transfer learning’ the NTK does is of this sort. As you say, since the tangent kernel doesn’t change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function’s data points. This means that, for the NTK to transfer anything from one task to another, it’s not enough for the tasks to both feature, for instance, eyes. It’s that the eyes have to correlate with the output in the exact same way in both tasks. Indeed, the transfer learning could actually hurt the generalization. Nor is its effect invariant under simple transformations like flipping the sign of the target function(this would change beneficial transfer to harmful). By default, for functions that aren’t simple multiples, I expect the linear correlation between values on different axes to be about 0, even if the functions share many meaningful features. So while the NTK can do ‘transfer learning’ in a sense, it’s about as weak as possible, and I strongly doubt that this sort of transfer is sufficient to explain transfer learning’s successes in practice(but don’t have empirical proof).
I do think the empirical results pretty strongly suggest that the NTK/GP model captures everything important about neural nets, at least in terms of their performance on the original task.
It’s true that NTK/GP perform pretty closely to finite nets on the tasks we’ve tried them on so far, but those tasks are pretty simple and we already had decent non-NN solutions. Generally the pattern is ’”GP matches NNs on really simple tasks, NTK on somewhat harder ones”. I think the data we have is consistent with this breaking down as we move to the harder problems that have no good non-NN solutions. I would be very interested in seeing an experiment with NTK on, say, ImageNet for this reason, but as far as I know no one’s done so because of the prohibitive computational cost.
I only found one directly-relevant study, which is on way too small and simple a system for me to draw much of a conclusion from it, but it does seem to have worked.
Thanks for the link—will read this tomorrow.
BTW, thanks for humoring me throughout this thread. This is really useful, and my understanding is updating considerably.
And thank you for engaging in detail—I have also found this very helpful in forcing me to clarify(partially to myself) what my actual beliefs are.
Hmm, so regarding the linear combinations, it’s true that there are some linear combinations that will change by Θ(1) in the large-width limit—just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don’t have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute “we expect most linear combinations to change” though—the CLT argument implies that we should expect almost all combinations to not appreciably change. Not sure what effect this would have on the PCA and still think it’s plausible that it doesn’t change at all(actually, I think Greg Yang states that it doesn’t change in section 9 of his paper, haven’t read that part super carefully though)
So I think I was a bit careless in saying that the NTK can’t do transfer learning at all—a more exact statement might be “the NTK does the minimal amount of transfer learning possible”. What I mean by this is, any learning algorithm can do transfer learning if the task we are ‘transferring’ to is sufficiently similar to the original task—for instance, if it’s just the exact same task but with a different data sample. I claim that the ‘transfer learning’ the NTK does is of this sort. As you say, since the tangent kernel doesn’t change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function’s data points. This means that, for the NTK to transfer anything from one task to another, it’s not enough for the tasks to both feature, for instance, eyes. It’s that the eyes have to correlate with the output in the exact same way in both tasks. Indeed, the transfer learning could actually hurt the generalization. Nor is its effect invariant under simple transformations like flipping the sign of the target function(this would change beneficial transfer to harmful). By default, for functions that aren’t simple multiples, I expect the linear correlation between values on different axes to be about 0, even if the functions share many meaningful features. So while the NTK can do ‘transfer learning’ in a sense, it’s about as weak as possible, and I strongly doubt that this sort of transfer is sufficient to explain transfer learning’s successes in practice(but don’t have empirical proof).
It’s true that NTK/GP perform pretty closely to finite nets on the tasks we’ve tried them on so far, but those tasks are pretty simple and we already had decent non-NN solutions. Generally the pattern is ’”GP matches NNs on really simple tasks, NTK on somewhat harder ones”. I think the data we have is consistent with this breaking down as we move to the harder problems that have no good non-NN solutions. I would be very interested in seeing an experiment with NTK on, say, ImageNet for this reason, but as far as I know no one’s done so because of the prohibitive computational cost.
Thanks for the link—will read this tomorrow.
And thank you for engaging in detail—I have also found this very helpful in forcing me to clarify(partially to myself) what my actual beliefs are.