It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.