Noosphere89 comments on Deep learning models might be secretly (almost) linear

Noosphere89 26 Apr 2023 15:45 UTC
6 points
1
One other problem of NTK/GP theory is that it isn’t able to capture feature learning/transfer learning, and in general starts to break down as models get more complicated. In essence, NTK/GP fails to capture some empirical realities.

From the post “NTK/GP Models of Neural Nets Can’t Learn Features”:

Since people are talking about the NTK/GP hypothesis of neural nets again, I thought it might be worth bringing up some recent research in the area that casts doubt on their explanatory power. The upshot is: NTK/GP models of neural networks can’t learn features. By ‘feature learning’ I mean the process where intermediate neurons come to represent task-relevant features such as curves, elements of grammar, or cats. Closely related to feature learning is transfer learning, the typical practice whereby a neural net is trained on one task, then ‘fine-tuned’ with a lower learning to rate to fit another task, usually with less data than the first. This is often a powerful way to approach learning in the low-data regime, but NTK/GP models can’t do it at all.

The reason for this is pretty simple. During training on the ‘old task’, NTK stays in the ‘tangent space’ of the network’s initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all; only the output function does.[1] Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons’ functions don’t change at all. Any meaningful function like a ‘car detector’ would need to be there at initialization—extremely unlikely for functions of any complexity. This lack of feature learning implies a lack of meaningful transfer learning as well: since the NTK is just doing linear regression using an (infinite) fixed set of functions, the only ‘transfer’ that can occur is shifting where the regression starts in this space. This could potentially speed up convergence, but it wouldn’t provide any benefits in terms of representation efficiency for tasks with few data points[2]. This property holds for the GP limit as well—the distribution of functions computed by intermediate neurons doesn’t change after conditioning on the outputs, so networks sampled from the GP posterior wouldn’t be useful for transfer learning either.

This also makes me skeptical of the Mingard et al. result about SGD being equivalent to picking a random neural net with given performance, given that picking a random net is equivalent to running a GP regression in the wide-width limit. In particular, it makes me skeptical that this result will generalize to the complex models and tasks we care about. ‘GP/NTK performs similarly to SGD on simple tasks’ has been found before, but it tends to break down as the tasks become more complex.[3]

In essence, NTK/GP can’t transfer learn because it stays where it’s originally at in the transfer space, and this doesn’t change even in the limit of NTK.

A link to the post is below:

https://www.lesswrong.com/posts/76cReK4Mix3zKCWNT/ntk-gp-models-of-neural-nets-can-t-learn-features