This post arose from a feeling in a few conversations that I wasn’t being crisp enough or epistemically virtuous enough when discussing the relationship between gradient-based ML methods and natural selection/mutate-and-select methods. Some people would respond like, ‘yep, seems good’, while others were far less willing to entertain analogies there. Clearly there was some logical uncertainty and room for learning, so I decided to ‘math it out’ and ended up clarifying a few details about the relationship, while leaving a lot unresolved. Evidently for some readers this is still not crisp or epistemically virtuous enough!
I still endorse this post as the neatest explanation I’m aware of relating gradient descent to natural selection under certain approximations. I take the proof of the limiting approximation to be basically uncontroversial[1], but I think my discussion of simplifying assumptions (and how to move past them) is actually the most valuable part of the post.
Overall I introduced three models of natural selection
All three are (in the limit of small mutations) performing gradient steps. The third one took a bit more creativity to invent, and is probably where I derived the most insight from this work.
All three are still far from ‘actual real biological natural selection’!
What important features of natural selection are missing?
Speciation!
This is a very distinguishing and interesting feature of real biological natural selection
I think this is still basically right, though perhaps more broadly construed than the specific mechanism described in the PBT paper
Any within-lifetime things!
Within-lifetime learning
Sexual selection and offspring preference
Cultural accumulation
Epigenetics
Recombination hacks
Anything with intragenomic conflict
Transposons, segregation distortion, etc.
So what?
Another point I didn’t touch on in this post itself is what to make of any of this.
Understanding ML
For predicting the nature of ML artefacts, I don’t think speciation is relevant, so that’s a point in favour of these models. I do think population-dependent dynamics (effectively self-play) are potentially very relevant, depending on the setting, which is why in the post I said,
As such it may be appropriate to think of real natural selection as performing something locally equivalent to SGD but globally more like self-play PBT.
How much can we pre-dict? When attempting to draw more specific conclusions (e.g. about particular inductive biases or generalisation), I think in practice analogies to natural selection are going to be screened off by specific evidence quite easily. But it’s not clear that we can easily get that more specific evidence in advance, and for more generally-applicable but less-specific claims, I think natural selection gives us one good prior to reason from.
If we’re trying to draw conclusions about intelligent systems, we should make sure to note that a lot of impressive intelligence-bearing artefacts in nature (brains etc.) are grown and developed within-lifetime! This makes the object of natural selection (genomes, mostly) something like hyperparameters or reward models or learning schedules or curricula rather than like fully-fledged cognitive algorithms.
In a recent exchange, 1a3orn shared some interesting resources which make similar connections between brain-like learning systems and gradient-based systems. More commonalities!
Understanding natural selection
Sometimes people want to understand the extent to which natural selection ‘is optimising for’ something (and what the exact moving pieces are). Playing with the maths here and specifying some semantics via the models has helped sharpen my own thinking on this. For example, see my discussion of ‘fitness’ here
The original pretheoretic term ‘fitness’ meant ‘being fitted/suitable/capable (relative to a context)’, and this is what Darwin and co were originally pointing to. (Remember they didn’t have genes or Mendel until decades later!)
The modern technical usage of ‘fitness’ very often operationalises this, for organisms, to be something like number of offspring, and for alleles/traits to be something like change in prevalence (perhaps averaged and/or normalised relative to some reference).
So natural selection is the ex post tautology ‘that which propagates in fact propagates’.
If we allow for ex ante uncertainty, we can talk about probabilities of selection/fixation and expected time to equilibrium and such. Here, ‘fitness’ is some latent property, understood as a distribution over outcomes.
If we look at longer timescales, ‘fitness’ is heavily bimodal: in many cases a particular allele/trait either fixes or goes extinct[3]. If we squint, we can think of this unknown future outcome as the hidden ground truth of latent fitness, about which some bits are revealed over time and over generations.
How can we reconcile this claim with the fact that the operationalised ‘relative fitness’ often walks approximately randomly, at least not often sustainedly upward[4]? Well, it’s precisely because it’s relative—relative to a changing series of fitness landscapes over time. Those landscapes change in part as a consequence of abiotic processes, partly as a consequence of other species’ changes, and often as a consequence of the very trait changes which natural selection is itself imposing within a population/species!
So, I think, we can say with a straight face that natural selection is optimising (weakly) for increased fitness, even while a changing fitness landscape means that almost by definition relative fitness hovers around a constant for most extant lineages. I don’t think it’s optimising on species, but on lineages (which sometimes correspond).[5]
In further (unpublished) mathsy scribbles around the time of this post, I also played with rederiving variations on the Price equation, and spent some time thinking about probabilities of fixation and time-to-fixation (corroborating some of Eliezer’s old claims). These were good exercises, but not obviously worth the time to write up.
I was also working with Holly Elmore on potential insights from some more specific mechanisms in natural selection regarding intragenomic conflict. I learned a lot (in particular about how ‘computer sciencey’ a lot of biological machinery is!) but didn’t make any generalisable insights. I do expect there might be something in this area though.
Understanding control
The connections here were part of a broader goal of mine to understand ‘deliberation’ and ‘control’. I’ve had a hard time making real progress on this since (in part due to time spent on other things), but I do feel my understanding of these has sharpened usefully. Spending some time closely pondering the connection between different optimisation procedures definitely provided some insights there.
I recently came across the ‘complex systems’ kind of view on adaptation and control and wonder if I might be converging in that direction.
The biggest puzzle-piece I want to see cracked regards the temporal extent of predictors/deliberators. Greater competence seems tightly linked to the ability to do ‘more lookahead’. I think this is one of the keys which gives rise to ‘deliberate exploration’/‘experimentation’, which is one of my top candidate intelligence explosion feedback loops[6]. My incomplete discussion of deliberation was heading in that direction. Some more recent gestures include some disorganised shortform discussion and my planner simulator conjecture:
something like, ‘every (simplest) simulator of a planner contains (something homomorphic to) a planner’.
How far are we stretching to call this ‘equivalence’?
The proofs demonstrate that all three models of natural selection perform a noisy realisation of a gradient step (in the limit of small mutations).
As I called out in the post, I didn’t pay much attention to step size, nor to the particular stochastic distribution of updates. To my mind, this is enough to put the three models of natural selection equivalent to something well within the class of ‘stochastic gradient methods’[7]. ‘SGD’ is often used to refer to this broader class of methods, but it might be a bit misleading to use the term ‘SGD’ without qualification, which after all is often used to refer to a more specific stochastic gradient implementation.
the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm.
which is the most attentive criticism I’ve had of this post.
Aren’t we stretching things quite far if we’re including momentum methods and related, with history/memory-sensitive updates? Note that natural selection can implement a kind of momentum too (e.g. via within-lifetime behavioural stuff like migration, offspring preference, and sexual selection)! Neither my models nor the ‘SGD’ they’re equivalent to exhibit this.
I never thought this conclusion needed shoring up in the first place, and in the cases where it’s not accepted, it’s not clear to me whether mathing it out like this is really going to help.
In cases where the relative fitness of a trait corresponds with its prevalence, there can be a dynamic equilibrium at neither of these modes. Consider evolutionary stable strategies. But the vast majority of mutations ever have hit the ‘extinct’ attractor, and a lot of extant material is of the form ‘ancestor of a large proportion of living organisms’.
Though note we do see (briefly?) sustained upward fitness in times of abundance, as notably in human population and in adaptive radiation in response to new resources, habitats, and niches becoming available.
Now, if the earlier instances of now-extinct lineages were somehow evolutionarily ‘frozen’ and periodically revived back into existence, we really would see that natural selection pushes for increased fitness. But because those lineages aren’t (by definition) around any more, the fitness landscape’s changes over time are under no obligation to be transitive, so in fact a faceoff between a chicken and a velociraptor might tell a different story.
I think exploration heuristics are found throughout nature, some ‘intrinsic curiosity’ reward shaping gets further (e.g. human and animal play), but ‘deliberate exploration’ (planning to arrange complicated scenarios with high anticipated information value) really sets humans (and perhaps a few other animals) apart. Then with cultural accumulation and especially the scientific revolution, we’ve collectively got really good at this deliberate exploration, and exploded even faster.
Aren’t we stretching things quite far if we’re including momentum methods and related, with history/memory-sensitive updates? Note that natural selection can implement a kind of momentum too (e.g. via within-lifetime behavioural stuff like migration, offspring preference, and sexual selection)! Neither my models nor the ‘SGD’ they’re equivalent to exhibit this.
Is it all that different? SGD momentum methods are usually analogized literally to ‘heavy balls’ etc. And I suspect, given how some gradient-update-based meta-learning methods work, you can cast within-lifetime updates as somehow involving a momentum.
Origin and summary
This post arose from a feeling in a few conversations that I wasn’t being crisp enough or epistemically virtuous enough when discussing the relationship between gradient-based ML methods and natural selection/mutate-and-select methods. Some people would respond like, ‘yep, seems good’, while others were far less willing to entertain analogies there. Clearly there was some logical uncertainty and room for learning, so I decided to ‘math it out’ and ended up clarifying a few details about the relationship, while leaving a lot unresolved. Evidently for some readers this is still not crisp or epistemically virtuous enough!
I still endorse this post as the neatest explanation I’m aware of relating gradient descent to natural selection under certain approximations. I take the proof of the limiting approximation to be basically uncontroversial[1], but I think my discussion of simplifying assumptions (and how to move past them) is actually the most valuable part of the post.
Overall I introduced three models of natural selection
an annealing-style degenerate natural selection, which is most obviously equivalent in the limit to a gradient step
a one-mutant-population-at-a-time model (with fixation or extinction before another mutation arises)
(in the discussion) a multi-mutations-in-flight model with horizontal transfer (which is most similar to real natural selection)
All three are (in the limit of small mutations) performing gradient steps. The third one took a bit more creativity to invent, and is probably where I derived the most insight from this work.
All three are still far from ‘actual real biological natural selection’!
What important features of natural selection are missing?
Speciation!
This is a very distinguishing and interesting feature of real biological natural selection
All my models (one way or another) rule out speciation, which is called out but could be clearer in the post
A model with gradients over ‘mixed strategies’ might be able to incorporate speciation
Variability of the fitness landscape
The interaction of fitness dependent on population distribution is totally not covered by my models
I called out Population-Based Training as the most obvious ML analogue to this
I think this is still basically right, though perhaps more broadly construed than the specific mechanism described in the PBT paper
Any within-lifetime things!
Within-lifetime learning
Sexual selection and offspring preference
Cultural accumulation
Epigenetics
Recombination hacks
Anything with intragenomic conflict
Transposons, segregation distortion, etc.
So what?
Another point I didn’t touch on in this post itself is what to make of any of this.
Understanding ML
For predicting the nature of ML artefacts, I don’t think speciation is relevant, so that’s a point in favour of these models. I do think population-dependent dynamics (effectively self-play) are potentially very relevant, depending on the setting, which is why in the post I said,
One main conclusion people want to point to when making this kind of analogy is that selecting for thing-X-achievers doesn’t necessarily produce thing-X-wanters. i.e. goal-misgeneralisation aka mesa-optimisation aka optimisation-daemons. I guess tightening up the maths sort of shores up this kind of conclusion?[2]
Thomas Kwa has a nice brief list of other retrodictions of the analogy between gradient-based ML and natural selection.
How much can we pre-dict? When attempting to draw more specific conclusions (e.g. about particular inductive biases or generalisation), I think in practice analogies to natural selection are going to be screened off by specific evidence quite easily. But it’s not clear that we can easily get that more specific evidence in advance, and for more generally-applicable but less-specific claims, I think natural selection gives us one good prior to reason from.
If we’re trying to draw conclusions about intelligent systems, we should make sure to note that a lot of impressive intelligence-bearing artefacts in nature (brains etc.) are grown and developed within-lifetime! This makes the object of natural selection (genomes, mostly) something like hyperparameters or reward models or learning schedules or curricula rather than like fully-fledged cognitive algorithms.
In a recent exchange, 1a3orn shared some interesting resources which make similar connections between brain-like learning systems and gradient-based systems. More commonalities!
Understanding natural selection
Sometimes people want to understand the extent to which natural selection ‘is optimising for’ something (and what the exact moving pieces are). Playing with the maths here and specifying some semantics via the models has helped sharpen my own thinking on this. For example, see my discussion of ‘fitness’ here
In further (unpublished) mathsy scribbles around the time of this post, I also played with rederiving variations on the Price equation, and spent some time thinking about probabilities of fixation and time-to-fixation (corroborating some of Eliezer’s old claims). These were good exercises, but not obviously worth the time to write up.
I was also working with Holly Elmore on potential insights from some more specific mechanisms in natural selection regarding intragenomic conflict. I learned a lot (in particular about how ‘computer sciencey’ a lot of biological machinery is!) but didn’t make any generalisable insights. I do expect there might be something in this area though.
Understanding control
The connections here were part of a broader goal of mine to understand ‘deliberation’ and ‘control’. I’ve had a hard time making real progress on this since (in part due to time spent on other things), but I do feel my understanding of these has sharpened usefully. Spending some time closely pondering the connection between different optimisation procedures definitely provided some insights there.
I recently came across the ‘complex systems’ kind of view on adaptation and control and wonder if I might be converging in that direction.
The biggest puzzle-piece I want to see cracked regards the temporal extent of predictors/deliberators. Greater competence seems tightly linked to the ability to do ‘more lookahead’. I think this is one of the keys which gives rise to ‘deliberate exploration’/‘experimentation’, which is one of my top candidate intelligence explosion feedback loops[6]. My incomplete discussion of deliberation was heading in that direction. Some more recent gestures include some disorganised shortform discussion and my planner simulator conjecture:
How far are we stretching to call this ‘equivalence’?
The proofs demonstrate that all three models of natural selection perform a noisy realisation of a gradient step (in the limit of small mutations).
As I called out in the post, I didn’t pay much attention to step size, nor to the particular stochastic distribution of updates. To my mind, this is enough to put the three models of natural selection equivalent to something well within the class of ‘stochastic gradient methods’[7]. ‘SGD’ is often used to refer to this broader class of methods, but it might be a bit misleading to use the term ‘SGD’ without qualification, which after all is often used to refer to a more specific stochastic gradient implementation.
nostalgebraist calls me out on this
which is the most attentive criticism I’ve had of this post.
Aren’t we stretching things quite far if we’re including momentum methods and related, with history/memory-sensitive updates? Note that natural selection can implement a kind of momentum too (e.g. via within-lifetime behavioural stuff like migration, offspring preference, and sexual selection)! Neither my models nor the ‘SGD’ they’re equivalent to exhibit this.
nostalgebraist’s dissatisfaction notwithstanding; these are good criticisms but appear miss a lot of the caveats already present in the original post.
I never thought this conclusion needed shoring up in the first place, and in the cases where it’s not accepted, it’s not clear to me whether mathing it out like this is really going to help.
In cases where the relative fitness of a trait corresponds with its prevalence, there can be a dynamic equilibrium at neither of these modes. Consider evolutionary stable strategies. But the vast majority of mutations ever have hit the ‘extinct’ attractor, and a lot of extant material is of the form ‘ancestor of a large proportion of living organisms’.
Though note we do see (briefly?) sustained upward fitness in times of abundance, as notably in human population and in adaptive radiation in response to new resources, habitats, and niches becoming available.
Now, if the earlier instances of now-extinct lineages were somehow evolutionarily ‘frozen’ and periodically revived back into existence, we really would see that natural selection pushes for increased fitness. But because those lineages aren’t (by definition) around any more, the fitness landscape’s changes over time are under no obligation to be transitive, so in fact a faceoff between a chicken and a velociraptor might tell a different story.
I think exploration heuristics are found throughout nature, some ‘intrinsic curiosity’ reward shaping gets further (e.g. human and animal play), but ‘deliberate exploration’ (planning to arrange complicated scenarios with high anticipated information value) really sets humans (and perhaps a few other animals) apart. Then with cultural accumulation and especially the scientific revolution, we’ve collectively got really good at this deliberate exploration, and exploded even faster.
e.g. vanilla SGD, momentum, RMSProp, Adagrad, Adam, …
Maybe you’re just not committed enough to momentum.
Haha mind blown. Thanks for the reference! Different kind of momentum, but still...
Is it all that different? SGD momentum methods are usually analogized literally to ‘heavy balls’ etc. And I suspect, given how some gradient-update-based meta-learning methods work, you can cast within-lifetime updates as somehow involving a momentum.