Daniel Kokotajlo comments on NTK/GP Models of Neural Nets Can’t Learn Features

Daniel Kokotajlo 7 May 2021 10:55 UTC
LW: 2 AF: 2
AF
Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons’ functions don’t change at all.
Any meaningful function like a ‘car detector’ would need to be there at initialization—extremely unlikely for functions of any complexity.
I used to think it would be extremely unlikely for a randomly initialized neural net to contain a subnetwork that performs just as well as the entire neural net does after training. But the multi-prize lottery ticket results seem to show just that. So now I don’t know what to think when it comes to what sorts of things are likely or unlikely when it comes to this stuff. In particular, is it really so unlikely that ‘car detector’ functions really do exist somewhere in the random jumble of a sufficiently big randomly initialized NN? Or maybe they don’t exist right away, but with very slight tweaks they do?
- interstice 7 May 2021 18:52 UTC
  3 points
  Parent
  They would exist in a sufficiently big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there’s a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they’d have a negligible effect on the NTK’s learning ability(‘slight tweaks’ can’t happen in the NTK regime since no intermediate functions change by definition)
  A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.
  - johnswentworth 7 May 2021 19:05 UTC
    4 points
    Parent
    My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.
    +1 to this in particular; I think this is the main point Daniel (and many people like Daniel) are missing here. There’s a very big difference between “car detector functions exist somewhere in the random jumble of a sufficiently big randomly initialized NN” vs “the net can be pruned to yield a car detector function”, and the LTH papers show the latter.
    - Daniel Kokotajlo 10 May 2021 10:52 UTC
      4 points
      Parent
      I think I get this distinction; I realize the NN papers show the latter; I guess our disagreement is about how big a deal / how surprising this is.

Daniel Kokotajlo comments on NTK/​GP Models of Neural Nets Can’t Learn Features

Daniel Kokotajlo comments on NTK/GP Models of Neural Nets Can’t Learn Features