Obviously true but also wierd that anyone would think that’s relevant? Even for a population which is perfectly aligned in expectation, you wouldn’t expect any individual to be perfectly aligned.
Some people think that misalignment in RL is fundamentally impossible. I’m pretty sure you’re not in that category. But for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
The CoinRun thing could also work in this context, but is subject to the (misguided) rebuttal “oh but that’s just because the model’s not smart enough to understand R”. So human examples can also be useful in the course of such conversations.
For any reasonable choice of a genetic utility function …
Also what if all the DNA information still exists, but stored more compactly on digital computers, which then in simulation replicate enormously, and the simulations themselves expand as our compute spreads a bit into space and advances further..
Here are a bunch of claims that are equally correct:
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles it functionally (regardless of whether it resembles it chemically or mechanistically).
[infinitely many more things like that]
(Related to “goal misgeneralization”). If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments. (I figure that’s what you mean when you say “for any reasonable choice of a genetic utility function…”?) But to make that argument, I think you’d have to start talking more specifically about what you expect AI alignment to look like and thus why some of those bullet point RL algorithms are more disanalagous to AI alignment than others. You seem not to be doing that, unless I missed it, instead you seem to be just talking intuitively about what’s “reasonable” (not what’s a tight analogy).
(Again, my own position is that all of those bullet points are sufficiently disanalogous to AI alignment that it’s not really a useful example, with the possible exception of arguing with someone who takes the extreme position that misalignment is impossible in RL as discussed at the top of this comment.)
Some people think that misalignment in RL is fundamentally impossible. . for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
I don’t actually believe that, but I also suspect that for those who do the example isn’t relevant because evolution is not a RL algorithm.
Evolution is an outer optimizer that uses RL as the inner optimizer. Evolution proceeds by performing many experiments in parallel, all of which are imperfect/flawed, and the flaws are essential to the algorithms’s progress. The individual inner RL optimizer can’t be perfectly aligned as if they were the outer optimizer wouldn’t have actually worked in the first place, and the outer optimizer can be perfectly aligned even though every single instance of the inner optimizer is at least somewhat misaligned. This is perhaps less commonly understood than I initially assumed—but true nonetheless.
Technology evolves according to an evolutionary process that is more similar/related to genetic evolution (but with greatly improved update operators). The analogy is between genetic evolution over RL brains <-> tech evolution over RL AGIs.
If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. .. You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation.
I made this diagram recently—I just colored some things red to flag areas where I think we’re disagreeing.
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm? Or do you not even think it’s a learning algorithm? I’m pretty confused by your perspective. Everyone calls PPO an RL algorithm, and I think of evolution as “kinda like PPO but with a probably-much-less-efficient learning rule because it’s not properly calculating gradients”. On the other hand, I’m not sure exactly what the definition of RL is—I don’t think it’s perfectly standardized.
(We both agree that, as a learning algorithm, evolution is very weird compared to the learning algorithms that people have been programming in the past and will continue programming in the future. For example, ML runs normally involve updating the trained model every second or sub-second or whatever, not “one little trained model update step every 20 years”. In other words, I think we both agree that the right column is a much much closer analogy for future AGI training than the middle column.)
(But as weird a learning algorithm as the middle column is, it’s still a learning algorithm. So if we want to make and discuss extremely general statements that apply to every possible learning algorithm, then it’s fair game to talk about the middle column.)
The second thing is on the bottom row, I get the impression that
you want to emphasize that the misalignment could be much higher (e.g. if nobody cared about biological children & kin),
I want to emphasize that the misalignment could be much less (e.g. if men were paying their life savings for the privilege of being a sperm donor).
But that’s not really a disagreement, because both of those things are true. :)
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm?”.
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead.
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation
It’s very weird notion of what constitutes an evidence. If you built AGI and your interpretability tools show you that AGI is plotting to kill you, it would be pretty hard evidence in favor of sharp left turn, even if you were still alive.
Not at all—just look at the definition of solomonoff induction: the distribution over world models/theories is updated strictly on new historical observation bits, and never on future predicted observations. If you observe mental states inside the AGI, that is naturally valid observational evidence from your pov. But that is very different from you predicting that the AGI is going to kill you, and then updating your world model based on those internal predictions—those feedback loops rapidly diverge from reality (and may be related to schizophrenia).
Some people think that misalignment in RL is fundamentally impossible. I’m pretty sure you’re not in that category. But for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
The CoinRun thing could also work in this context, but is subject to the (misguided) rebuttal “oh but that’s just because the model’s not smart enough to understand R”. So human examples can also be useful in the course of such conversations.
Here are a bunch of claims that are equally correct:
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles it functionally (regardless of whether it resembles it chemically or mechanistically).
[infinitely many more things like that]
(Related to “goal misgeneralization”). If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments. (I figure that’s what you mean when you say “for any reasonable choice of a genetic utility function…”?) But to make that argument, I think you’d have to start talking more specifically about what you expect AI alignment to look like and thus why some of those bullet point RL algorithms are more disanalagous to AI alignment than others. You seem not to be doing that, unless I missed it, instead you seem to be just talking intuitively about what’s “reasonable” (not what’s a tight analogy).
(Again, my own position is that all of those bullet points are sufficiently disanalogous to AI alignment that it’s not really a useful example, with the possible exception of arguing with someone who takes the extreme position that misalignment is impossible in RL as discussed at the top of this comment.)
I don’t actually believe that, but I also suspect that for those who do the example isn’t relevant because evolution is not a RL algorithm.
Evolution is an outer optimizer that uses RL as the inner optimizer. Evolution proceeds by performing many experiments in parallel, all of which are imperfect/flawed, and the flaws are essential to the algorithms’s progress. The individual inner RL optimizer can’t be perfectly aligned as if they were the outer optimizer wouldn’t have actually worked in the first place, and the outer optimizer can be perfectly aligned even though every single instance of the inner optimizer is at least somewhat misaligned. This is perhaps less commonly understood than I initially assumed—but true nonetheless.
Technology evolves according to an evolutionary process that is more similar/related to genetic evolution (but with greatly improved update operators). The analogy is between genetic evolution over RL brains <-> tech evolution over RL AGIs.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation.
I made this diagram recently—I just colored some things red to flag areas where I think we’re disagreeing.
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm? Or do you not even think it’s a learning algorithm? I’m pretty confused by your perspective. Everyone calls PPO an RL algorithm, and I think of evolution as “kinda like PPO but with a probably-much-less-efficient learning rule because it’s not properly calculating gradients”. On the other hand, I’m not sure exactly what the definition of RL is—I don’t think it’s perfectly standardized.
(We both agree that, as a learning algorithm, evolution is very weird compared to the learning algorithms that people have been programming in the past and will continue programming in the future. For example, ML runs normally involve updating the trained model every second or sub-second or whatever, not “one little trained model update step every 20 years”. In other words, I think we both agree that the right column is a much much closer analogy for future AGI training than the middle column.)
(But as weird a learning algorithm as the middle column is, it’s still a learning algorithm. So if we want to make and discuss extremely general statements that apply to every possible learning algorithm, then it’s fair game to talk about the middle column.)
The second thing is on the bottom row, I get the impression that
you want to emphasize that the misalignment could be much higher (e.g. if nobody cared about biological children & kin),
I want to emphasize that the misalignment could be much less (e.g. if men were paying their life savings for the privilege of being a sperm donor).
But that’s not really a disagreement, because both of those things are true. :)
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Random @nostalgebraist blog post: ““Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer. … Life itself can be described as an RL problem”
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
It’s very weird notion of what constitutes an evidence. If you built AGI and your interpretability tools show you that AGI is plotting to kill you, it would be pretty hard evidence in favor of sharp left turn, even if you were still alive.
Not at all—just look at the definition of solomonoff induction: the distribution over world models/theories is updated strictly on new historical observation bits, and never on future predicted observations. If you observe mental states inside the AGI, that is naturally valid observational evidence from your pov. But that is very different from you predicting that the AGI is going to kill you, and then updating your world model based on those internal predictions—those feedback loops rapidly diverge from reality (and may be related to schizophrenia).
I’m talking about observable evidence, like, transhumanists claiming they will drop their biological bodies on first possibility.