The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead.
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.