Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
Except that selection and gradient descent are closely mathematically related—you have to make a bunch of simplifying assumptions, but ‘mutate and select’ (evolution) is actually equivalent to ‘make a small approximate gradient step’ (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don’t think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it’s possible to draw a connection stronger than “they both do local optimization and involve randomness.”)
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Fixed ‘fitness function’ or objective function mapping genome to continuous ‘fitness score’
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.
Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I’d say?
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired.
Hmmm, this strikes me as much too strong (especially ‘this lets you directly mold the circuits’).
Remember also that with RLHF, we’re learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff.
I also think there’s a fair alternative analogy to be drawn like
evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights
within-lifetime-learning of organism ~ in-context something-something of NN
(this is one analogy I commonly drew before RLHF came along.)
So, look, the analogies are loose, but they aren’t baseless.
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
3Blue1Brown—Gradient descent, how neural networks learn
Emergent Garden—Watching Neural Networks Learn
WIRED—Computer Scientist Explains Machine Learning in 5 Levels of Difficulty
Except that selection and gradient descent are closely mathematically related—you have to make a bunch of simplifying assumptions, but ‘mutate and select’ (evolution) is actually equivalent to ‘make a small approximate gradient step’ (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don’t think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it’s possible to draw a connection stronger than “they both do local optimization and involve randomness.”)
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.
Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I’d say?
Hmmm, this strikes me as much too strong (especially ‘this lets you directly mold the circuits’).
Remember also that with RLHF, we’re learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff.
I also think there’s a fair alternative analogy to be drawn like
evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights
within-lifetime-learning of organism ~ in-context something-something of NN
(this is one analogy I commonly drew before RLHF came along.)
So, look, the analogies are loose, but they aren’t baseless.