In hindsight, I probably should have explained this more carefully. “Today’s neural networks already contain dog-recognizing subcircuits at initialization” was not an accurate summary of exactly what I think is implausible.
Here’s a more careful version of the claim:
I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it’s not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The “lottery ticket” story is not actually a good intuition for what’s going on there; the pruning step itself does a whole bunch of optimization work, and “it’s just upweighting a pre-existing function” is not an accurate intuition for that optimization.
(The paper on randomized filters is orthogonal to this—it sounds like they’re finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they’re just using random initialization for the first few layers, assuming I’m understanding it correctly.)
If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron’s output, then the “lottery ticket” intuition would be a very good description of what’s going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase “masking is training”, which seems like a good way to think about it. A masking operation isn’t just “picking a winning lottery ticket”, it’s changing the functional behavior of the nodes.
But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
I don’t think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of ‘reliability’ for this is very low. I don’t have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it’s real before leaning on it too much?
To be clear, I think I agree with your refinement as a more detailed picture of what’s going on; I guess I just think you’re overselling how wrong the naive version is?
Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”
In hindsight, I probably should have explained this more carefully. “Today’s neural networks already contain dog-recognizing subcircuits at initialization” was not an accurate summary of exactly what I think is implausible.
Here’s a more careful version of the claim:
I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it’s not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the “lottery ticket” intuition (i.e. training just picks out some pre-existing useful functionality) to work.
The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The “lottery ticket” story is not actually a good intuition for what’s going on there; the pruning step itself does a whole bunch of optimization work, and “it’s just upweighting a pre-existing function” is not an accurate intuition for that optimization.
(The paper on randomized filters is orthogonal to this—it sounds like they’re finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they’re just using random initialization for the first few layers, assuming I’m understanding it correctly.)
If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron’s output, then the “lottery ticket” intuition would be a very good description of what’s going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase “masking is training”, which seems like a good way to think about it. A masking operation isn’t just “picking a winning lottery ticket”, it’s changing the functional behavior of the nodes.
I don’t think I agree, because of the many-to-many relationship between neurons and subcircuits. Or, like, I think the standard of ‘reliability’ for this is very low. I don’t have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it’s real before leaning on it too much?
To be clear, I think I agree with your refinement as a more detailed picture of what’s going on; I guess I just think you’re overselling how wrong the naive version is?
Plausible.
Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”