Jeremy Gillen comments on Neural Tangent Kernel Distillation

Jeremy Gillen 6 Oct 2022 23:43 UTC
2 points
0
I don’t understand why we want a theoretical explanation of neural network generalization to have the same “parts we can examine” as a neural network.
If we could describe a prior such that Bayesian updating on that prior gave the same generalization behavior as neural networks, then this would not “explain feature learning”, right? But it would still be a perfectly useful theoretical account of NN generalization.
I agree that evidence seems to suggest that finite width neural networks seem to generalize a little better than infinite width NTK regression. But attributing this to feature learning doesn’t follow. Couldn’t neural networks just have a slightly better implicit prior, for which the NTK is just an approximation?
- interstice 7 Oct 2022 2:39 UTC
  3 points
  0
  Parent
  
  If we could describe a prior such that Bayesian updating on that prior gave the same generalization behavior
  
  Sure, I just think that any such prior is likely to explicitly or implicitly explain feature learning, since feature learning is part of what makes SGD-trained networks work.
  
  But attributing this to feature learning doesn’t follow
  
  I think it’s likely that dog-nose-detecting neurons play a part in helping to classify dogs, curve-detecting neurons play a part in classifying objects more generally, etc. This is all that is meant by ‘feature learning’ - intermediate neurons changing in some functionally-useful way. And feature learning is required for pre-training on a related task to be helpful, so it would be a weird coincidence if it was useless when training on a single task.
  
  There’s also a bunch of examples of interpretability work where they find intermediate neurons having changed in clearly functionally-useful ways. I haven’t read it in detail but this article analyzes how a particular family of neurons comes together to implement a curve-detecting algorithm, it’s clear that the intermediate neurons have to change substantially in order for this circuit to work.