Why is it so often that analogies are drawn between natural selection and gradient descent in a machine learning context? They are both optimizing over a fitness function, but isn’t there an important difference in what they are optimizing over?
Natural selection is broadly optimizing over the architecture, initial parameters of the architecture, and the learning dynamics (how one updates the parameters of the architecture given data), which led to the architecture of the brain and methods of learning like STDP, in which the parameters of the architecture are the neurons of the brain.
Isn’t gradient descent instead what we pick to be the learning dynamics, where we then pick our architecture (e.g. transformer) and initial parameters (e.g. Xavier initialization), so actually it makes more sense to draw an analogy between gradient descent and the optimizer learnt by natural selection (STDP, etc.), as opposed to natural selection itself?
Though natural selection is a simple optimization process, the optimizer (learning dynamics) learnt by this process could be very complex, and so reasoning like ‘natural selection is simple so maybe the simplicity of gradient descent is sufficient’ is not very strong?
I think it’s basically correct to say that evolution is mostly designing a within-lifetime learning algorithm (searching over neural architecture, reward function, etc.), and I argue about it all the time, see here and here. But there’s another school of thought (e.g. Steven Pinker, or Cosmides & Tooby) where within-lifetime learning is not so important—see here. I think Eliezer & Nate are closer to the second school of thought, and some disagreements stem from that.
I do think there are “obvious” things that one can say about learning algorithms in general for which evolution provides a perfectly fine example. E.g. “if I run an RL algorithm with reward function R, the trained model will not necessarily have an explicit endorsed goal to maximize R (or even know what R is)”. If you think invoking evolution as an example has too much baggage, fine, there are other examples or arguments that would also work.
Yes, what you say makes sense. One caveat might be that object-level gradient descent isn’t always the thing we want an analogy for—we might expect future systems to do a lot of meta-learning, where evolution might be a better analogy than human learning. Or we might expect future systems to take actions that affect their own architecture in a way that looks like deliberate engineering, which doesn’t have a great analogy with either.
Yeah, I personally think the better biological analogue for gradient descent is the “run-and-tumble” motion of bacteria.
Take an e. coli. It has a bunch of flagella, pointing in all directions. When it rotates its flagella clockwise, each of them ends up pushing in a random direction, which results in the cell chaotically tumbling without going very far. When it rotates its flagella counterclockwise, they get tangled up with each other and all end up pointing the same direction, and the cell moves in a roughly straight line. The more attractants and fewer repellants there are, the more the cell rotates its flagella counterclockwise.
And that’s it. That’s the entire strategy by which e. coli navigates to food.
Here’s a page with an animation of how this extremely basic behavior approximates gradient descent.
All that said, evolution looks kinda like gradient descent if you squint. For mind design, evolution would be gradient descent over the hyperparameters (and cultural evolution would be gradient descent over the training data generation process, and learning would be gradient descent over sensory data, and all of these gradients would steer in different but not entirely orthogonal directions).