So I read through the Maddox et al. study, and it definitely does not show that the NTK can do transfer learning. They pre-train using SGD on a single task, then use the NTK computed on the trained network to do Bayesian inference on some other tasks. They say in a footnote on page 9, “Note that in theory, there is no need to train the network at all. We found that it is practically useful to train the network to learn good representations.” This makes me suspect that they tried using the NTK to learn the transfer parameters but it didn’t work.
Regarding the empirical results about the NTK explaining the performance of neural nets, I found this study interesting. They computed the ‘empirical NTK’ on some finite-width networks and compared the performance of the solution found by SGD to that found by solving the NTK. For standard widths, the NTK solution performed substantially worse(up to 20% drop in accuracy). The gap closed to some extent, but not completely, upon making the network much wider. The size of the gap also correlated with the complexity of the task(0.5% gap for MNIST, 13% for CIFAR, 18% for a subset of ImageNet). The trajectory of the weights also diverged substantially from the NTK prediction, even on MNIST. All of this seems consistent with the NTK being a decent first-order approximation that breaks down on the really hard tasks that require the networks to do non-trivial feature learning.
Ah, that is interesting. This definitely updates me moderately toward the “NTKs don’t learn features” hypothesis.
BTW, does this hypothesis also mean that feature learning should break down in ordinary nets as they scale up? Or does increasing the data alongside the parameter count counteract that?
I think nets are usually increased in depth as well as width when they are ‘scaled up’, so the NTK limit doesn’t apply—the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.
So I read through the Maddox et al. study, and it definitely does not show that the NTK can do transfer learning. They pre-train using SGD on a single task, then use the NTK computed on the trained network to do Bayesian inference on some other tasks. They say in a footnote on page 9, “Note that in theory, there is no need to train the network at all. We found that it is practically useful to train the network to learn good representations.” This makes me suspect that they tried using the NTK to learn the transfer parameters but it didn’t work.
Regarding the empirical results about the NTK explaining the performance of neural nets, I found this study interesting. They computed the ‘empirical NTK’ on some finite-width networks and compared the performance of the solution found by SGD to that found by solving the NTK. For standard widths, the NTK solution performed substantially worse(up to 20% drop in accuracy). The gap closed to some extent, but not completely, upon making the network much wider. The size of the gap also correlated with the complexity of the task(0.5% gap for MNIST, 13% for CIFAR, 18% for a subset of ImageNet). The trajectory of the weights also diverged substantially from the NTK prediction, even on MNIST. All of this seems consistent with the NTK being a decent first-order approximation that breaks down on the really hard tasks that require the networks to do non-trivial feature learning.
Ah, that is interesting. This definitely updates me moderately toward the “NTKs don’t learn features” hypothesis.
BTW, does this hypothesis also mean that feature learning should break down in ordinary nets as they scale up? Or does increasing the data alongside the parameter count counteract that?
I think nets are usually increased in depth as well as width when they are ‘scaled up’, so the NTK limit doesn’t apply—the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.