I was in a discussion yesterday that made it seem pretty plausible that you’re wrong—this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.
function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits.
I haven’t actually read the paper I’m referencing, but my understanding is that this argument doesn’t work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).
That being said, I don’t know whether “present at the beginning” is the same as “easily found by gradient descent”.
See my comment here. The paper you link (like others I’ve seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.
I agree with your comment that the claim “I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization” is ambiguous—“contains” can mean “under pruning”.
Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is “a single neuron lighting up” where you draw the line?
It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there’s a linear combination of neurons (before the output layer) which does, then it seems we’re good to go: GD could find that pretty easily.
I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.
But without the tangent-space hypothesis, it’s unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is “what would be necessary for the lottery ticket intuition” isn’t clear to me. (Is there a more obvious argument for this, according to you?)
Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I’m not particularly worried about where that line “should” be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.
The tangent-space hypothesis implies something close to “gd finds a solution if and only if there’s already a dog detecting neuron” (for large networks, that is) -- specifically it seems to imply something pretty close to “there’s already a feature”, where “feature” means a linear combination of existing neurons within a single layer
gd in fact trains NNs to recognize dogs
Therefore, we’re still in the territory of “there’s already a dog detector”
The tangent-space hypothesis implies something close to this but not quite—instead of ‘dog-detecting neuron’, it’s ‘parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector’. This would include (the partial derivative w.r.t.) neurons via their bias.
I was in a discussion yesterday that made it seem pretty plausible that you’re wrong—this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.
I haven’t actually read the paper I’m referencing, but my understanding is that this argument doesn’t work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).
That being said, I don’t know whether “present at the beginning” is the same as “easily found by gradient descent”.
See my comment here. The paper you link (like others I’ve seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.
Ah, I should have read comments more carefully.
I agree with your comment that the claim “I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization” is ambiguous—“contains” can mean “under pruning”.
Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is “a single neuron lighting up” where you draw the line?
It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there’s a linear combination of neurons (before the output layer) which does, then it seems we’re good to go: GD could find that pretty easily.
I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.
But without the tangent-space hypothesis, it’s unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is “what would be necessary for the lottery ticket intuition” isn’t clear to me. (Is there a more obvious argument for this, according to you?)
Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I’m not particularly worried about where that line “should” be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.
Wait… so:
The tangent-space hypothesis implies something close to “gd finds a solution if and only if there’s already a dog detecting neuron” (for large networks, that is) -- specifically it seems to imply something pretty close to “there’s already a feature”, where “feature” means a linear combination of existing neurons within a single layer
gd in fact trains NNs to recognize dogs
Therefore, we’re still in the territory of “there’s already a dog detector”
...yeah?
The tangent-space hypothesis implies something close to this but not quite—instead of ‘dog-detecting neuron’, it’s ‘parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector’. This would include (the partial derivative w.r.t.) neurons via their bias.
Not quite. The linear expansion isn’t just over the parameters associated with one layer, it’s over all the parameters in the whole net.