Feature learning requires the intermediate neurons to adapt to structures in the data that are relevant to the task being learned, but in the NTK limit the intermediate neurons’ functions don’t change at all. Any meaningful function like a ‘car detector’ would need to be there at initialization—extremely unlikely for functions of any complexity.
I used to think it would be extremely unlikely for a randomly initialized neural net to contain a subnetwork that performs just as well as the entire neural net does after training. But the multi-prize lottery ticket results seem to show just that. So now I don’t know what to think when it comes to what sorts of things are likely or unlikely when it comes to this stuff. In particular, is it really so unlikely that ‘car detector’ functions really do exist somewhere in the random jumble of a sufficiently big randomly initialized NN? Or maybe they don’t exist right away, but with very slight tweaks they do?
They would exist in a sufficiently big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there’s a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they’d have a negligible effect on the NTK’s learning ability(‘slight tweaks’ can’t happen in the NTK regime since no intermediate functions change by definition)
A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.
My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.
+1 to this in particular; I think this is the main point Daniel (and many people like Daniel) are missing here. There’s a very big difference between “car detector functions exist somewhere in the random jumble of a sufficiently big randomly initialized NN” vs “the net can be pruned to yield a car detector function”, and the LTH papers show the latter.
I used to think it would be extremely unlikely for a randomly initialized neural net to contain a subnetwork that performs just as well as the entire neural net does after training. But the multi-prize lottery ticket results seem to show just that. So now I don’t know what to think when it comes to what sorts of things are likely or unlikely when it comes to this stuff. In particular, is it really so unlikely that ‘car detector’ functions really do exist somewhere in the random jumble of a sufficiently big randomly initialized NN? Or maybe they don’t exist right away, but with very slight tweaks they do?
They would exist in a sufficiently big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there’s a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they’d have a negligible effect on the NTK’s learning ability(‘slight tweaks’ can’t happen in the NTK regime since no intermediate functions change by definition)
A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.
+1 to this in particular; I think this is the main point Daniel (and many people like Daniel) are missing here. There’s a very big difference between “car detector functions exist somewhere in the random jumble of a sufficiently big randomly initialized NN” vs “the net can be pruned to yield a car detector function”, and the LTH papers show the latter.
I think I get this distinction; I realize the NN papers show the latter; I guess our disagreement is about how big a deal / how surprising this is.