The tangent-space hypothesis implies something close to “gd finds a solution if and only if there’s already a dog detecting neuron” (for large networks, that is) -- specifically it seems to imply something pretty close to “there’s already a feature”, where “feature” means a linear combination of existing neurons within a single layer
gd in fact trains NNs to recognize dogs
Therefore, we’re still in the territory of “there’s already a dog detector”
The tangent-space hypothesis implies something close to this but not quite—instead of ‘dog-detecting neuron’, it’s ‘parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector’. This would include (the partial derivative w.r.t.) neurons via their bias.
Wait… so:
The tangent-space hypothesis implies something close to “gd finds a solution if and only if there’s already a dog detecting neuron” (for large networks, that is) -- specifically it seems to imply something pretty close to “there’s already a feature”, where “feature” means a linear combination of existing neurons within a single layer
gd in fact trains NNs to recognize dogs
Therefore, we’re still in the territory of “there’s already a dog detector”
...yeah?
The tangent-space hypothesis implies something close to this but not quite—instead of ‘dog-detecting neuron’, it’s ‘parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector’. This would include (the partial derivative w.r.t.) neurons via their bias.
Not quite. The linear expansion isn’t just over the parameters associated with one layer, it’s over all the parameters in the whole net.