Unfortunately, the strongest forms of the hypothesis do not seem plausible—e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization.
We also compare to random, untrained weights because Jarrett et al. (2009) showed — quite strikingly — that the combination of random convolutional filters, rectification, pooling, and local normalization can work almost as well as learned features. They reported this result on relatively small networks of two or three learned layers and on the smaller Caltech-101 dataset (Fei-Fei et al., 2004). It is natural to ask whether or not the nearly optimal performance of random filters they report carries over to a deeper network trained on a larger dataset.
(My interpretation of their results is ‘yeah actually randomly initialized convs do pretty well on imagenet’; I remember coming across a paper that answer that question more exactly and getting a clearer ‘yes’ answer but I can’t find it at the moment; I remember them freezing a conv architecture and then only training the fully connected net at the end.)
Why do you doubt this? Are you seeing a bunch of evidence that I’m not? Or are you imagining new architectures that people haven’t done these tests for yet / have done these tests and the new architectures fail?
[Maybe your standards are higher than mine—in the DLT paper, they’re able to get 65% performance on CIFAR-10 by just optimizing a binary mask on the randomly initialized parameters, which is ok but not good.]
In hindsight, I probably should have explained this more carefully. “Today’s neural networks already contain dog-recognizing subcircuits at initialization” was not an accurate summary of exactly what I think is implausible.
Here’s a more careful version of the claim:
I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it’s not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The “lottery ticket” story is not actually a good intuition for what’s going on there; the pruning step itself does a whole bunch of optimization work, and “it’s just upweighting a pre-existing function” is not an accurate intuition for that optimization.
(The paper on randomized filters is orthogonal to this—it sounds like they’re finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they’re just using random initialization for the first few layers, assuming I’m understanding it correctly.)
If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron’s output, then the “lottery ticket” intuition would be a very good description of what’s going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase “masking is training”, which seems like a good way to think about it. A masking operation isn’t just “picking a winning lottery ticket”, it’s changing the functional behavior of the nodes.
But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
I don’t think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of ‘reliability’ for this is very low. I don’t have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it’s real before leaning on it too much?
To be clear, I think I agree with your refinement as a more detailed picture of what’s going on; I guess I just think you’re overselling how wrong the naive version is?
Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”
I think there are papers showing exactly this, like Deconstructing Lottery Tickets and What is the Best Multi-Stage Architecture for Object Recognition?. Another paper, describing the second paper:
(My interpretation of their results is ‘yeah actually randomly initialized convs do pretty well on imagenet’; I remember coming across a paper that answer that question more exactly and getting a clearer ‘yes’ answer but I can’t find it at the moment; I remember them freezing a conv architecture and then only training the fully connected net at the end.)
Why do you doubt this? Are you seeing a bunch of evidence that I’m not? Or are you imagining new architectures that people haven’t done these tests for yet / have done these tests and the new architectures fail?
[Maybe your standards are higher than mine—in the DLT paper, they’re able to get 65% performance on CIFAR-10 by just optimizing a binary mask on the randomly initialized parameters, which is ok but not good.]
In hindsight, I probably should have explained this more carefully. “Today’s neural networks already contain dog-recognizing subcircuits at initialization” was not an accurate summary of exactly what I think is implausible.
Here’s a more careful version of the claim:
I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it’s not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The “lottery ticket” story is not actually a good intuition for what’s going on there; the pruning step itself does a whole bunch of optimization work, and “it’s just upweighting a pre-existing function” is not an accurate intuition for that optimization.
(The paper on randomized filters is orthogonal to this—it sounds like they’re finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they’re just using random initialization for the first few layers, assuming I’m understanding it correctly.)
If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron’s output, then the “lottery ticket” intuition would be a very good description of what’s going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase “masking is training”, which seems like a good way to think about it. A masking operation isn’t just “picking a winning lottery ticket”, it’s changing the functional behavior of the nodes.
I don’t think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of ‘reliability’ for this is very low. I don’t have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it’s real before leaning on it too much?
To be clear, I think I agree with your refinement as a more detailed picture of what’s going on; I guess I just think you’re overselling how wrong the naive version is?
Plausible.
Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”