The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm?”.
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead.
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Random @nostalgebraist blog post: ““Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer. … Life itself can be described as an RL problem”
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.