Quintin Pope comments on Deep learning models might be secretly (almost) linear

Quintin Pope 24 Apr 2023 22:39 UTC
57 points
7
Some counter evidence:
- Kernelized Concept Erasure: concept encodings do have nonlinear components. Nonlinear kernels can erase certain parts of those encodings, but they cannot prevent other types of nonlinear kernels from extracting concept info from other parts of the embedding space.
- Limitations of the NTK for Understanding Generalization in Deep Learning: the neural tangent kernels of realistic neural networks continuously change throughout their training. Further, neither the initial kernels nor any of the empirical kernels from mid-training can reproduce the asymptotic scaling laws of the actual neural network, which are better than predicted by said kernels.
- Mechanistic Mode Connectivity: LMs often have non-connected solution basins, which correspond to different underlying mechanisms by which they make their classification decisions.
- beren 25 Apr 2023 12:25 UTC
  10 points
  1
  Parent
  Thanks for these links! This is exactly what I was looking for as per Cunningham’s law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?
  I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK does give useful explanations at a very coarse level of granularity. In general, to put a completely uncalibrated number on it, I feel like NNs are probably ’90% linear’ in their feature representations. Of course they have to have somewhat nonlinear representations as well. But otoh if we could get 90% of the way to features that would be massive progress and might be relatively easy.
  - Noosphere89 26 Apr 2023 15:45 UTC
    6 points
    1
    Parent
    One other problem of NTK/GP theory is that it isn’t able to capture feature learning/transfer learning, and in general starts to break down as models get more complicated. In essence, NTK/GP fails to capture some empirical realities.
    
    From the post “NTK/GP Models of Neural Nets Can’t Learn Features”:
    
    Since people are talking about the NTK/GP hypothesis of neural nets again, I thought it might be worth bringing up some recent research in the area that casts doubt on their explanatory power. The upshot is: NTK/GP models of neural networks can’t learn features. By ‘feature learning’ I mean the process where intermediate neurons come to represent task-relevant features such as curves, elements of grammar, or cats. Closely related to feature learning is transfer learning, the typical practice whereby a neural net is trained on one task, then ‘fine-tuned’ with a lower learning to rate to fit another task, usually with less data than the first. This is often a powerful way to approach learning in the low-data regime, but NTK/GP models can’t do it at all.
    
    The reason for this is pretty simple. During training on the ‘old task’, NTK stays in the ‘tangent space’ of the network’s initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all; only the output function does.[1] Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons’ functions don’t change at all. Any meaningful function like a ‘car detector’ would need to be there at initialization—extremely unlikely for functions of any complexity. This lack of feature learning implies a lack of meaningful transfer learning as well: since the NTK is just doing linear regression using an (infinite) fixed set of functions, the only ‘transfer’ that can occur is shifting where the regression starts in this space. This could potentially speed up convergence, but it wouldn’t provide any benefits in terms of representation efficiency for tasks with few data points[2]. This property holds for the GP limit as well—the distribution of functions computed by intermediate neurons doesn’t change after conditioning on the outputs, so networks sampled from the GP posterior wouldn’t be useful for transfer learning either.
    
    This also makes me skeptical of the Mingard et al. result about SGD being equivalent to picking a random neural net with given performance, given that picking a random net is equivalent to running a GP regression in the wide-width limit. In particular, it makes me skeptical that this result will generalize to the complex models and tasks we care about. ‘GP/NTK performs similarly to SGD on simple tasks’ has been found before, but it tends to break down as the tasks become more complex.[3]
    
    In essence, NTK/GP can’t transfer learn because it stays where it’s originally at in the transfer space, and this doesn’t change even in the limit of NTK.
    
    A link to the post is below:
    
    https://www.lesswrong.com/posts/76cReK4Mix3zKCWNT/ntk-gp-models-of-neural-nets-can-t-learn-features