If you don’t know what “inner” and “outer” optimization are, or why birth control or masturbation might be examples, then check out one of the posts here before reading this one. Thanks to Evan, Scott, and Richard for discussions around these ideas—though I doubt all their objections are settled yet.
Claim: from an evolutionary fitness perspective, masturbation is an inner alignment failure, but birth control is an outer alignment failure.
More generally:
Assuming the outer optimizer successfully optimizes, failure to generalize from training environment to deployment environment is always an outer alignment failure (regardless of whether an inner optimizer appears).
Assuming outer alignment, inner alignment problems can occur only as the result of imperfect optimization.
All of these claims stem from one main argument, which can be applied to any supposed inner alignment problem.
This post will outline the argument, give a few examples, and talk about some implications.
Motivating Example: Birth Control vs Masturbation
Consider some simple biological system—like a virus or a mycobacterium—which doesn’t perform any significant optimization. The system has been optimized by evolution to perform well in certain environments, but it doesn’t perform any optimization “at runtime”.
From an evolutionary fitness perspective, the system can still end up misaligned. The main way this happens is an environment shift: the system evolved to reproduce in a certain environment, and doesn’t generalize well to other environments. A virus which spreads quickly in the rodents of one particular pacific island may not spread well on the mainland; a mycobacterium evolved for the guts of ancestral-environment humans may not flourish in the guts of humans with a modern diet.
Obviously this scenario does not involve any inner optimization, because there is no inner optimizer. It’s purely an outer alignment failure: the objective “reproductive fitness in ancestral environment” is not aligned with the objective “reproductive fitness in current environment”.
Now imagine a more complex biological system which is performing some optimization—e.g. humans. Evolutionary “training” optimized humans for reproductive fitness in the ancestral environment. Yet the modern environment contains many possibilities which weren’t around in the ancestral environment—e.g. birth control pills. “Wanting birth control” did not decrease reproductive fitness in the ancestral environment (because it wasn’t available anyway), but it does in the modern environment. Thus, birth control is an outer alignment problem: “reproductive fitness in the ancestral environment” is not aligned with “reproductive fitness in the modern environment”.
By contrast, masturbation is a true inner alignment failure. Masturbation was readily available in the ancestral environment, and arose from a misalignment between human objectives and evolutionary objectives (i.e. fitness in the ancestral environment). It presumably didn’t decrease fitness very much in the ancestral environment, not enough for evolution to quickly find a work around, but it sure seems unlikely to have increased fitness—there is some “waste” involved. Key point: had evolution somehow converged to the “true optimum” of fitness in the ancestral environment (or even just something a lot more optimal), then that more-reproductively-fit “human” probably wouldn’t masturbate, even if it were still an inner optimizer.
General Argument
“Outer alignment” means that the main optimization objective is a good proxy for the thing we actually want. In the context of systems which are trained offline, this means that “performance in the training environment” is a good proxy for “performance in the deployment environment”—systems which perform well in the training environment should perform well in the deployment environment. All systems. If there is any system which performs well in the training environment but not in the deployment environment, then that’s an outer alignment failure. The training environment and objective are not aligned with the deployment environment and objective.
This is true regardless of whether there happens to be an inner optimizer or not.
Corollary: assuming outer alignment holds, all inner alignment failures are the result of imperfect outer optimization. Reasoning: assuming outer alignment, good performance on the proxy implies good performance on what we actually want. Conversely, if the inner optimizer gives suboptimal performance on what we want, then it gives suboptimal performance on the proxy. And if it gives suboptimal performance on the proxy, then the outer optimizer should not have selected that design anyway—unless the outer optimizer has failed to fully optimize.
Here’s a mathematical version of this argument, in a reinforcement learning context. We’ll define the optimal policy π∗ as
π∗=argmaxπE[∑trt|π]
… where rt are the reward signals and the expectation is taken over the training distribution. Evan (one of the people who introduced the inner/outer optimization split) defines outer alignment in this context as “π∗ is aligned with our true goals”.
Let’s assume that our reinforcement learner spits out a policy π containing a malicious inner optimizer, resulting in behavior unaligned with our true goals. There are two possibilities. Either:
The unaligned policy π is outer-optimal (i.e. optimal for the outer optimization problem), π=π∗, so the outer-optimal policy π∗ is unaligned with our goals. This is an outer alignment failure.
The policy π is not outer-optimal, π≠π∗, so the outer optimizer failed to fully optimize.
Thus: assuming outer alignment holds, all inner alignment failures are the result of imperfect optimization. QED.
Given Perfect Outer Optimization, Deceptive Alignment Is Always An Outer Alignment Failure
In particular, we can apply this argument in the context of deceptive alignment: an inner optimizer which realizes it’s in training, and “plays along” until it’s deployed, at which point it acts maliciously.
We have two possibilities. The first possibility is that the deceptive inner optimizer was not actually optimal even in training. If the outer optimizer is imperfect, the inner optimizer might be able to “trick it” into choosing a suboptimal policy during training—more on that shortly.
But assuming that the deceptive inner optimizer is part of the outer-optimal policy, the outer objective must be unaligned. After all, the policy containing the deceptive inner optimizer is itself an example of a policy which is optimal under the objective but is unaligned to human values. Clearly, the outer optimizer is willing to assign maximal reward to systems which do not produce good behavior by human standards. That’s an outer alignment failure.
When Inner Alignment Does Matter: Imperfect Optimization
Inner alignment is still an important issue, because in practice we do not use perfect brute-force optimizers. We do not enumerate every single possibility in the search space and then choose the highest-scoring. Instead, we use imperfect optimizers, like gradient descent.
In principle, an inner optimizer could “hack” an imperfect optimization algorithm. I talk more about what this looks like in Demons in Imperfect Search (also check out Tesselating Hills, a concrete example in the context of gradient descent). Biological evolution provides an excellent example: greedy genes. In sexual organisms, genes are randomly selected from each parent during fertilization. But certain gene variants can bias that choice in favor of themselves, allowing those genes to persist even if they decrease the organism’s reproductive fitness somewhat.
This is a prototypical example of what a true inner alignment problem would look like: the inner optimizer would “hack” the search function of the outer optimizer.
One takeaway: inner alignment is a conceptually easy problem—we just need to fully optimize the objective. But it’s still potentially difficult in practice, because (assuming P ≠ NP) perfect optimization is Hard. From this perspective, the central problem of inner alignment is to develop optimization algorithms which do not contain exploitable vulnerabilities.
On the flip side, we probably also want outer objectives which do not contain exploitable vulnerabilities. That’s an outer alignment problem.
UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
As definitions go, these seem to cut reality at a reasonable set of joints, though the names seem misleading (the names certainly mislead me!).
Adopting their definitions, the core argument of this post is something like:
generalization error and outer alignment are both orthogonal to mesa-optimization; they are problems which need to be addressed regardless of whether mesa-optimizers are an issue
therefore mesa-optimizers are a problem in their own right only to the extent that they exploit imperfections in the (outer) optimization algorithm.
“Inner Alignment Failures” Which Are Actually Outer Alignment Failures
If you don’t know what “inner” and “outer” optimization are, or why birth control or masturbation might be examples, then check out one of the posts here before reading this one. Thanks to Evan, Scott, and Richard for discussions around these ideas—though I doubt all their objections are settled yet.
Claim: from an evolutionary fitness perspective, masturbation is an inner alignment failure, but birth control is an outer alignment failure.
More generally:
Assuming the outer optimizer successfully optimizes, failure to generalize from training environment to deployment environment is always an outer alignment failure (regardless of whether an inner optimizer appears).
Assuming outer alignment, inner alignment problems can occur only as the result of imperfect optimization.
All of these claims stem from one main argument, which can be applied to any supposed inner alignment problem.
This post will outline the argument, give a few examples, and talk about some implications.
Motivating Example: Birth Control vs Masturbation
Consider some simple biological system—like a virus or a mycobacterium—which doesn’t perform any significant optimization. The system has been optimized by evolution to perform well in certain environments, but it doesn’t perform any optimization “at runtime”.
From an evolutionary fitness perspective, the system can still end up misaligned. The main way this happens is an environment shift: the system evolved to reproduce in a certain environment, and doesn’t generalize well to other environments. A virus which spreads quickly in the rodents of one particular pacific island may not spread well on the mainland; a mycobacterium evolved for the guts of ancestral-environment humans may not flourish in the guts of humans with a modern diet.
Obviously this scenario does not involve any inner optimization, because there is no inner optimizer. It’s purely an outer alignment failure: the objective “reproductive fitness in ancestral environment” is not aligned with the objective “reproductive fitness in current environment”.
Now imagine a more complex biological system which is performing some optimization—e.g. humans. Evolutionary “training” optimized humans for reproductive fitness in the ancestral environment. Yet the modern environment contains many possibilities which weren’t around in the ancestral environment—e.g. birth control pills. “Wanting birth control” did not decrease reproductive fitness in the ancestral environment (because it wasn’t available anyway), but it does in the modern environment. Thus, birth control is an outer alignment problem: “reproductive fitness in the ancestral environment” is not aligned with “reproductive fitness in the modern environment”.
By contrast, masturbation is a true inner alignment failure. Masturbation was readily available in the ancestral environment, and arose from a misalignment between human objectives and evolutionary objectives (i.e. fitness in the ancestral environment). It presumably didn’t decrease fitness very much in the ancestral environment, not enough for evolution to quickly find a work around, but it sure seems unlikely to have increased fitness—there is some “waste” involved. Key point: had evolution somehow converged to the “true optimum” of fitness in the ancestral environment (or even just something a lot more optimal), then that more-reproductively-fit “human” probably wouldn’t masturbate, even if it were still an inner optimizer.
General Argument
“Outer alignment” means that the main optimization objective is a good proxy for the thing we actually want. In the context of systems which are trained offline, this means that “performance in the training environment” is a good proxy for “performance in the deployment environment”—systems which perform well in the training environment should perform well in the deployment environment. All systems. If there is any system which performs well in the training environment but not in the deployment environment, then that’s an outer alignment failure. The training environment and objective are not aligned with the deployment environment and objective.
This is true regardless of whether there happens to be an inner optimizer or not.
Corollary: assuming outer alignment holds, all inner alignment failures are the result of imperfect outer optimization. Reasoning: assuming outer alignment, good performance on the proxy implies good performance on what we actually want. Conversely, if the inner optimizer gives suboptimal performance on what we want, then it gives suboptimal performance on the proxy. And if it gives suboptimal performance on the proxy, then the outer optimizer should not have selected that design anyway—unless the outer optimizer has failed to fully optimize.
Here’s a mathematical version of this argument, in a reinforcement learning context. We’ll define the optimal policy π∗ as
π∗=argmaxπE[∑trt|π]
… where rt are the reward signals and the expectation is taken over the training distribution. Evan (one of the people who introduced the inner/outer optimization split) defines outer alignment in this context as “π∗ is aligned with our true goals”.
Let’s assume that our reinforcement learner spits out a policy π containing a malicious inner optimizer, resulting in behavior unaligned with our true goals. There are two possibilities. Either:
The unaligned policy π is outer-optimal (i.e. optimal for the outer optimization problem), π=π∗, so the outer-optimal policy π∗ is unaligned with our goals. This is an outer alignment failure.
The policy π is not outer-optimal, π≠π∗, so the outer optimizer failed to fully optimize.
Thus: assuming outer alignment holds, all inner alignment failures are the result of imperfect optimization. QED.
Given Perfect Outer Optimization, Deceptive Alignment Is Always An Outer Alignment Failure
In particular, we can apply this argument in the context of deceptive alignment: an inner optimizer which realizes it’s in training, and “plays along” until it’s deployed, at which point it acts maliciously.
We have two possibilities. The first possibility is that the deceptive inner optimizer was not actually optimal even in training. If the outer optimizer is imperfect, the inner optimizer might be able to “trick it” into choosing a suboptimal policy during training—more on that shortly.
But assuming that the deceptive inner optimizer is part of the outer-optimal policy, the outer objective must be unaligned. After all, the policy containing the deceptive inner optimizer is itself an example of a policy which is optimal under the objective but is unaligned to human values. Clearly, the outer optimizer is willing to assign maximal reward to systems which do not produce good behavior by human standards. That’s an outer alignment failure.
When Inner Alignment Does Matter: Imperfect Optimization
Inner alignment is still an important issue, because in practice we do not use perfect brute-force optimizers. We do not enumerate every single possibility in the search space and then choose the highest-scoring. Instead, we use imperfect optimizers, like gradient descent.
In principle, an inner optimizer could “hack” an imperfect optimization algorithm. I talk more about what this looks like in Demons in Imperfect Search (also check out Tesselating Hills, a concrete example in the context of gradient descent). Biological evolution provides an excellent example: greedy genes. In sexual organisms, genes are randomly selected from each parent during fertilization. But certain gene variants can bias that choice in favor of themselves, allowing those genes to persist even if they decrease the organism’s reproductive fitness somewhat.
This is a prototypical example of what a true inner alignment problem would look like: the inner optimizer would “hack” the search function of the outer optimizer.
One takeaway: inner alignment is a conceptually easy problem—we just need to fully optimize the objective. But it’s still potentially difficult in practice, because (assuming P ≠ NP) perfect optimization is Hard. From this perspective, the central problem of inner alignment is to develop optimization algorithms which do not contain exploitable vulnerabilities.
On the flip side, we probably also want outer objectives which do not contain exploitable vulnerabilities. That’s an outer alignment problem.
UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
As definitions go, these seem to cut reality at a reasonable set of joints, though the names seem misleading (the names certainly mislead me!).
Adopting their definitions, the core argument of this post is something like:
(outer alignment) + (no generalization error) + (full optimization) ⇒ alignment
generalization error and outer alignment are both orthogonal to mesa-optimization; they are problems which need to be addressed regardless of whether mesa-optimizers are an issue
therefore mesa-optimizers are a problem in their own right only to the extent that they exploit imperfections in the (outer) optimization algorithm.