The Inner Alignment Problem
This is the third of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.
In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.
Current machine learning methods select learned algorithms by empirically evaluating their performance on a set of training data according to the base objective function. Thus, ML base optimizers select mesa-optimizers according to the output they produce rather than directly selecting for a particular mesa-objective. Moreover, the selected mesa-optimizer’s policy only has to perform well (as scored by the base objective) on the training data. If we adopt the assumption that the mesa-optimizer computes an optimal policy given its objective function, then we can summarize the relationship between the base and mesa- objectives as follows:(17) That is, the base optimizer maximizes its objective by choosing a mesa-optimizer with parameterization based on the mesa-optimizer’s policy , but not based on the objective function that the mesa-optimizer uses to compute this policy. Depending on the base optimizer, we will think of as the negative of the loss, the future discounted reward, or simply some fitness function by which learned algorithms are being selected.
An interesting approach to analyzing this connection is presented in Ibarz et al, where empirical samples of the true reward and a learned reward on the same trajectories are used to create a scatter-plot visualization of the alignment between the two.(18) The assumption in that work is that a monotonic relationship between the learned reward and true reward indicates alignment, whereas deviations from that suggest misalignment. Building on this sort of research, better theoretical measures of alignment might someday allow us to speak concretely in terms of provable guarantees about the extent to which a mesa-optimizer is aligned with the base optimizer that created it.
3.1. Pseudo-alignment
There is currently no complete theory of the factors that affect whether a mesa-optimizer will be pseudo-aligned—that is, whether it will appear aligned on the training data, while actually optimizing for something other than the base objective. Nevertheless, we outline a basic classification of ways in which a mesa-optimizer could be pseudo-aligned:
Proxy alignment,
Approximate alignment, and
Suboptimality alignment.
Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself. We’ll start by considering two special cases of proxy alignment: side-effect alignment and instrumental alignment.
First, a mesa-optimizer is side-effect aligned if optimizing for the mesa-objective has the direct causal result of increasing the base objective in the training distribution, and thus when the mesa-optimizer optimizes it results in an increase in . For an example of side-effect alignment, suppose that we are training a cleaning robot. Consider a robot that optimizes the number of times it has swept a dusty floor. Sweeping a floor causes the floor to be cleaned, so this robot would be given a good score by the base optimizer. However, if during deployment it is offered a way to make the floor dusty again after cleaning it (e.g. by scattering the dust it swept up back onto the floor), the robot will take it, as it can then continue sweeping dusty floors.
Second, a mesa-optimizer is instrumentally aligned if optimizing for the base objective has the direct causal result of increasing the mesa-objective in the training distribution, and thus the mesa-optimizer optimizes as an instrumental goal for the purpose of increasing . For an example of instrumental alignment, suppose again that we are training a cleaning robot. Consider a robot that optimizes the amount of dust in the vacuum cleaner. Suppose that in the training distribution the easiest way to get dust into the vacuum cleaner is to vacuum the dust on the floor. It would then do a good job of cleaning in the training distribution and would be given a good score by the base optimizer. However, if during deployment the robot came across a more effective way to acquire dust—such as by vacuuming the soil in a potted plant—then it would no longer exhibit the desired behavior.
We propose that it is possible to understand the general interaction between side-effect and instrumental alignment using causal graphs, which leads to our general notion of proxy alignment.
Suppose we model a task as a causal graph with nodes for all possible attributes of that task and arrows between nodes for all possible relationships between those attributes. Then we can also think of the mesa-objective and the base objective as nodes in this graph. For to be pseudo-aligned, there must exist some node such that is an ancestor of both and in the training distribution, and such that and increase with . If , this is side-effect alignment, and if , this is instrumental alignment.
This represents the most generalized form of a relationship between and that can contribute to pseudo-alignment. Specifically, consider the causal graph given in figure 3.1. A mesa-optimizer with mesa-objective will decide to optimize as an instrumental goal of optimizing , since increases . This will then result in increasing, since optimizing for has the side-effect of increasing . Thus, in the general case, side-effect and instrumental alignment can work together to contribute to pseudo-alignment over the training distribution, which is the general case of proxy alignment.
Figure 3.1. A causal diagram of the training environment for the different types of proxy alignment. The diagrams represent, from top to bottom, side-effect alignment (top), instrumental alignment (middle), and general proxy alignment (bottom). The arrows represent positive causal relationships—that is, cases where an increase in the parent causes an increase in the child.
Approximate alignment. A mesa-optimizer is approximately aligned if the mesa-objective and the base objective are approximately the same function up to some degree of approximation error related to the fact that the mesa-objective has to be represented inside the mesa-optimizer rather than being directly programmed by humans. For example, suppose you task a neural network with optimizing for some base objective that is impossible to perfectly represent in the neural network itself. Even if you get a mesa-optimizer that is as aligned as possible, it still will not be perfectly robustly aligned in this scenario, since there will have to be some degree of approximation error between its internal representation of the base objective and the actual base objective.
Suboptimality alignment. A mesa-optimizer is suboptimality aligned if some deficiency, error, or limitation in its optimization process causes it to exhibit aligned behavior on the training distribution. This could be due to computational constraints, unsound reasoning, a lack of information, irrational decision procedures, or any other defect in the mesa-optimizer’s reasoning process. Importantly, we are not referring to a situation where the mesa-optimizer is robustly aligned but nonetheless makes mistakes leading to bad outcomes on the base objective. Rather, suboptimality alignment refers to the situation where the mesa-optimizer is misaligned but nevertheless performs well on the base objective, precisely because it has been selected to make mistakes that lead to good outcomes on the base objective.
For an example of suboptimality alignment, consider a cleaning robot with a mesa-objective of minimizing the total amount of stuff in existence. If this robot has the mistaken belief that the dirt it cleans is completely destroyed, then it may be useful for cleaning the room despite doing so not actually helping it succeed at its objective. This robot will be observed to be a good optimizer of and hence be given a good score by the base optimizer. However, if during deployment the robot is able to improve its world model, it will stop exhibiting the desired behavior.
As another, perhaps more realistic example of suboptimality alignment, consider a mesa-optimizer with a mesa-objective and an environment in which there is one simple strategy and one complicated strategy for achieving . It could be that the simple strategy is aligned with the base optimizer, but the complicated strategy is not. The mesa-optimizer might then initially only be aware of the simple strategy, and thus be suboptimality aligned, until it has been run for long enough to come up with the complicated strategy, at which point it stops exhibiting the desired behavior.
3.2. The task
As in the second post, we will now consider the task the machine learning system is trained on. Specifically, we will address how the task affects a machine learning system’s propensity to produce pseudo-aligned mesa-optimizers.
Unidentifiability. It is a common problem in machine learning for a dataset to not contain enough information to adequately pinpoint a specific concept. This is closely analogous to the reason that machine learning models can fail to generalize or be susceptible to adversarial examples(19)—there are many more ways of classifying data that do well in training than any specific way the programmers had in mind. In the context of mesa-optimization, this manifests as pseudo-alignment being more likely to occur when a training environment does not contain enough information to distinguish between a wide variety of different objective functions. In such a case there will be many more ways for a mesa-optimizer to be pseudo-aligned than robustly aligned—one for each indistinguishable objective function. Thus, most mesa-optimizers that do well on the base objective will be pseudo-aligned rather than robustly aligned. This is a critical concern because it makes every other problem of pseudo-alignment worse—it is a reason that, in general, it is hard to find robustly aligned mesa-optimizers. Unidentifiability in mesa-optimization is partially analogous to the problem of unidentifiability in reward learning, in that the central issue is identifying the “correct” objective function given particular training data.(20) We will discuss this relationship further in the fifth post.
In the context of mesa-optimization, there is also an additional source of unidentifiability stemming from the fact that the mesa-optimizer is selected merely on the basis of its output. Consider the following toy reinforcement learning example. Suppose that in the training environment, pressing a button always causes a lamp to turn on with a ten-second delay, and that there is no other way to turn on the lamp. If the base objective depends only on whether the lamp is turned on, then a mesa-optimizer that maximizes button presses and one that maximizes lamp light will show identical behavior, as they will both press the button as often as they can. Thus, we cannot distinguish these two objective functions in this training environment. Nevertheless, the training environment does contain enough information to distinguish at least between these two particular objectives: since the high reward only comes after the ten-second delay, it must be from the lamp, not the button. As such, even if a training environment in principle contains enough information to identify the base objective, it might still be impossible to distinguish robustly aligned from proxy-aligned mesa-optimizers.
Proxy choice as pre-computation. Proxy alignment can be seen as a form of pre-computation by the base optimizer. Proxy alignment allows the base optimizer to save the mesa-optimizer computational work by pre-computing which proxies are valuable for the base objective and then letting the mesa-optimizer maximize those proxies.
Without such pre-computation, the mesa-optimizer has to infer at runtime the causal relationship between different input features and the base objective, which might require significant computational work. Moreover, errors in this inference could result in outputs that perform worse on the base objective than if the system had access to pre-computed proxies. If the base optimizer precomputes some of these causal relationships—by selecting the mesa-objective to include good proxies—more computation at runtime can be diverted to making better plans instead of inferring these relationships.
The case of biological evolution may illustrate this point. The proxies that humans care about—food, resources, community, mating, etc.—are relatively computationally easy to optimize directly, while correlating well with survival and reproduction in our ancestral environment. For a human to be robustly aligned with evolution would have required us to instead care directly about spreading our genes, in which case we would have to infer that eating, cooperating with others, preventing physical pain, etc. would promote genetic fitness in the long run, which is not a trivial task. To infer all of those proxies from the information available to early humans would have required greater (perhaps unfeasibly greater) computational resources than to simply optimize for them directly. As an extreme illustration, for a child in this alternate universe to figure out not to stub its toe, it would have to realize that doing so would slightly diminish its chances of reproducing twenty years later.
For pre-computation to be beneficial, there needs to be a relatively stable causal relationship between a proxy variable and the base objective such that optimizing for the proxy will consistently do well on the base objective. However, even an imperfect relationship might give a significant performance boost over robust alignment if it frees up the mesa-optimizer to put significantly more computational effort into optimizing its output. This analysis suggests that there might be pressure towards proxy alignment in complex training environments, since the more complex the environment, the more computational work pre-computation saves the mesa-optimizer. Additionally, the more complex the environment, the more potential proxy variables are available for the mesa-optimizer to use.
Furthermore, in the context of machine learning, this analysis suggests that a time complexity penalty (as opposed to a description length penalty) is a double-edged sword. In the second post, we suggested that penalizing time complexity might serve to reduce the likelihood of mesa-optimization. However, the above suggests that doing so would also promote pseudo-alignment in those cases where mesa-optimizers do arise. If the cost of fully modeling the base objective in the mesa-optimizer is large, then a pseudo-aligned mesa-optimizer might be preferred simply because it reduces time complexity, even if it would underperform a robustly aligned mesa-optimizer without such a penalty.
Compression of the mesa-optimizer. The description length of a robustly aligned mesa-optimizer may be greater than that of a pseudo-aligned mesa-optimizer. Since there are more pseudo-aligned mesa-objectives than robustly aligned mesa-objectives, pseudo-alignment provides more degrees of freedom for choosing a particularly simple mesa-objective. Thus, we expect that in most cases there will be several pseudo-aligned mesa-optimizers that are less complex than any robustly aligned mesa-optimizer.
This description cost is especially high if the learned algorithm’s input data does not contain easy-to-infer information about how to optimize for the base objective. Biological evolution seems to differ from machine learning in this sense, since evolution’s specification of the brain has to go through the information funnel of DNA. The sensory data that early humans received didn’t allow them to infer the existence of DNA, nor the relationship between their actions and their genetic fitness. Therefore, for humans to have been aligned with evolution would have required them to have an innately specified model of DNA, as well as the various factors influencing their inclusive genetic fitness. Such a model would not have been able to make use of environmental information for compression, and thus would have required a greater description length. In contrast, our models of food, pain, etc. can be very short since they are directly related to our input data.
3.3. The base optimizer
We now turn to how the base optimizer is likely to affect the propensity for a machine learning system to produce pseudo-aligned mesa-optimizers.
Hard-coded optimization. In the second post, we suggested that hard-coding an optimization algorithm—that is to say, choosing a model with built-in optimization—could be used to remove some of the incentives for mesa-optimization. Similarly, hard-coded optimization may be used to prevent some of the sources of pseudo-alignment, since it may allow one to directly specify or train the mesa-objective. Reward-predictive model-based reinforcement learning might be one possible way of accomplishing this.(21) For example, an ML system could include a model directly trained to predict the base objective together with a powerful hard-coded optimization algorithm. Doing this bypasses some of the problems of pseudo-alignment: if the mesa-optimizer is trained to directly predict the base reward, then it will be selected to make good predictions even if a bad prediction would result in a good policy. However, a learned model of the base objective will still be underdetermined off-distribution, so this approach by itself does not guarantee robust alignment.
Algorithmic range. We hypothesize that a model’s algorithmic range will have implications for how likely it is to develop pseudo-alignment. One possible source of pseudo-alignment that could be particularly difficult to avoid is approximation error—if a mesa-optimizer is not capable of faithfully representing the base objective, then it can’t possibly be robustly aligned, only approximately aligned. Even if a mesa-optimizer might theoretically be able to perfectly capture the base objective, the more difficult that is for it to do, the more we might expect it to be approximately aligned rather than robustly aligned. Thus, a large algorithmic range may be both a blessing and a curse: it makes it less likely that mesa-optimizers will be approximately aligned, but it also increases the likelihood of getting a mesa-optimizer in the first place.[1]
Subprocess interdependence. There are some reasons to believe that there might be more initial optimization pressure towards proxy aligned than robustly aligned mesa-optimizers. In a local optimization process, each parameter of the learned algorithm (e.g. the parameter vector of a neuron) is adjusted to locally improve the base objective conditional on the other parameters. Thus, the benefit for the base optimizer of developing a new subprocess will likely depend on what other subprocesses the learned algorithm currently implements. Therefore, even if some subprocess would be very beneficial if combined with many other subprocesses, the base optimizer may not select for it until the subprocesses it depends on are sufficiently developed. As a result, a local optimization process would likely result in subprocesses that have fewer dependencies being developed before those with more dependencies.
In the context of mesa-optimization, the benefit of a robustly aligned mesa-objective seems to depend on more subprocesses than at least some pseudo-aligned mesa-objectives. For example, consider a side-effect aligned mesa-optimizer optimizing for some set of proxy variables. Suppose that it needs to run some subprocess to model the relationship between its actions and those proxy variables. If we assume that optimizing the proxy variables is necessary to perform well on the base objective, then for a mesa-optimizer to be robustly aligned, it would also need to model the causal relationship between those proxy variables and the base objective, which might require additional subprocesses. Moreover, the benefit to the base optimizer of adding those subprocesses depends on the mesa-optimizer having additional subprocesses to model the relationship between its actions and those proxy variables. This informal argument suggests that if a mesa-optimizer’s computation neatly factors in this way, then developing a robustly aligned mesa-objective may require strictly more subprocesses than developing a pseudo-aligned mesa-objective.
This suggests that, at least in a local optimization process, mesa-optimizers might tend to start their development as proxy aligned before becoming robustly aligned. In other words, rather than simultaneously gaining competence and becoming aligned, we might expect such a system to first become competent at optimizing proxies, then possibly start becoming more robustly aligned.
Internals-based selection. In current machine learning systems, learned algorithms are selected primarily based on how their policies perform on the base objective. There is often some selection based on the internal structure of the learned algorithm, such as weight regularization, but such selection is usually very coarse. One possible idea for addressing pseudo-alignment could be to use a selection method based on a more sophisticated function of a learned algorithm’s internal workings. This approach is similar in spirit to hard-coded optimization: we use knowledge of the system’s internal workings to prevent it from being proxy-aligned.
One possibility might be to develop a meta-learner that is specifically trained to prevent pseudo-alignment. While this seems beyond current meta-learning techniques, the basic principle is similar to existing meta-learners designed to optimize for robustness. An extreme form of internals-based selection might have the base optimizer reason about or prove properties of the learned algorithm as part of its selection process.
While desirable, such methods are currently out of reach primarily due to the limitations of current transparency methods. One possible concern with all of these approaches, however, is that they might require the base optimizer to be better at searching for problems than the mesa-optimizer is at hiding them. Furthermore, in order to create a base optimizer that can do this, we may need to first train it using a third optimizer, which might result in the same problem at this second layer.
Adversarial training. One possible method to alleviate some of these concerns might be to sample environments adversarially from a very wide distribution. With such adversarial training, it might be possible to overcome these hurdles and make the robustly aligned objective the most reachable one. This would require making it so that, for every mesa-objective that is more reachable than the base objective , there is some situation where and disagree and the mesa-optimizer is incentivized to agree with .
If we model reachability of an objective as simply its length in bits, then distinguishing from every single more reachable gets exponentially harder as gets more complex. Thus, for a very complicated , sufficiently incentivizing the base optimizer to find a mesa-optimizer with that is likely to be very difficult, though not impossible.
Even in such a case, however, there would likely still be a period of time where the learned algorithm is a misaligned mesa-optimizer, leaving open an ominous possibility: the misaligned mesa-optimizer could figure out the correct actions to take based on while its objective function was still . We will call this situation deceptive alignment and will discuss it at greater length in the next post.
The fourth post in the Risks from Learned Optimization Sequence, titled “Deceptive Alignment,” can be found here.
- ↩︎
Though a large algorithmic range seems to make approximate alignment less likely, it is unclear how it might affect other forms of pseudo-alignment such as deceptive alignment.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- An overview of 11 proposals for building safe advanced AI by 29 May 2020 20:38 UTC; 213 points) (
- Humans provide an untapped wealth of evidence about alignment by 14 Jul 2022 2:31 UTC; 211 points) (
- Chris Olah’s views on AGI safety by 1 Nov 2019 20:13 UTC; 207 points) (
- Understanding “Deep Double Descent” by 6 Dec 2019 0:00 UTC; 150 points) (
- How do we become confident in the safety of a machine learning system? by 8 Nov 2021 22:49 UTC; 133 points) (
- Utility ≠ Reward by 5 Sep 2019 17:28 UTC; 130 points) (
- The Alignment Problem: Machine Learning and Human Values by 6 Oct 2020 17:41 UTC; 120 points) (
- Deceptive Alignment by 5 Jun 2019 20:16 UTC; 118 points) (
- Making AIs less likely to be spiteful by 26 Sep 2023 14:12 UTC; 116 points) (
- Conditioning Predictive Models: Large language models as predictors by 2 Feb 2023 20:28 UTC; 88 points) (
- Low-stakes alignment by 30 Apr 2021 0:10 UTC; 86 points) (
- Conditions for Mesa-Optimization by 1 Jun 2019 20:52 UTC; 84 points) (
- Risks from Learned Optimization: Conclusion and Related Work by 7 Jun 2019 19:53 UTC; 82 points) (
- Partial Agency by 27 Sep 2019 22:04 UTC; 75 points) (
- MATS AI Safety Strategy Curriculum by 7 Mar 2024 19:59 UTC; 68 points) (
- What exactly is GPT-3′s base objective? by 10 Nov 2021 0:57 UTC; 60 points) (
- Pacing Outside the Box: RNNs Learn to Plan in Sokoban by 25 Jul 2024 22:00 UTC; 59 points) (
- MIRI’s 2019 Fundraiser by 3 Dec 2019 1:16 UTC; 55 points) (
- Counterfactual Oracles = online supervised learning with random selection of training episodes by 10 Sep 2019 8:29 UTC; 52 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- MATS AI Safety Strategy Curriculum v2 by 7 Oct 2024 22:44 UTC; 42 points) (
- Optimization Provenance by 23 Aug 2019 20:08 UTC; 38 points) (
- Selection processes for subagents by 30 Jun 2022 23:57 UTC; 36 points) (
- Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI by 1 Jul 2020 17:30 UTC; 35 points) (
- Introduction to inaccessible information by 9 Dec 2021 1:28 UTC; 27 points) (
- More variations on pseudo-alignment by 4 Nov 2019 23:24 UTC; 27 points) (
- Conditioning Predictive Models: Making inner alignment as easy as possible by 7 Feb 2023 20:04 UTC; 27 points) (
- Inner alignment requires making assumptions about human values by 20 Jan 2020 18:38 UTC; 26 points) (
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by 15 Oct 2023 14:51 UTC; 26 points) (
- [AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison by 26 Dec 2019 1:10 UTC; 26 points) (
- Outer alignment and imitative amplification by 10 Jan 2020 0:26 UTC; 24 points) (
- Exploring Mild Behaviour in Embedded Agents by 27 Jun 2022 18:56 UTC; 21 points) (
- MIRI’s 2019 Fundraiser by 7 Dec 2019 0:30 UTC; 19 points) (EA Forum;
- Thoughts on safety in predictive learning by 30 Jun 2021 19:17 UTC; 19 points) (
- AXRP Episode 11 - Attainable Utility and Power with Alex Turner by 25 Sep 2021 21:10 UTC; 19 points) (
- 3 Nov 2019 20:16 UTC; 19 points) 's comment on But exactly how complex and fragile? by (
- How Interpretability can be Impactful by 18 Jul 2022 0:06 UTC; 18 points) (
- [AN #67]: Creating environments in which to study inner alignment failures by 7 Oct 2019 17:10 UTC; 17 points) (
- [AN #70]: Agents that help humans who are still learning about their own preferences by 23 Oct 2019 17:10 UTC; 16 points) (
- Motivations, Natural Selection, and Curriculum Engineering by 16 Dec 2021 1:07 UTC; 16 points) (
- Some real examples of gradient hacking by 22 Nov 2021 0:11 UTC; 15 points) (
- 25 Jan 2022 15:46 UTC; 13 points) 's comment on davidad’s Shortform by (
- 8 Dec 2019 22:57 UTC; 9 points) 's comment on What are some non-purely-sampling ways to do deep RL? by (
- Towards a solution to the alignment problem via objective detection and evaluation by 12 Apr 2023 15:39 UTC; 9 points) (
- 18 May 2020 19:32 UTC; 8 points) 's comment on Multi-agent safety by (
- 21 Jun 2022 2:11 UTC; 7 points) 's comment on Open Thread: Spring 2022 by (EA Forum;
- 12 Feb 2020 2:54 UTC; 5 points) 's comment on Demons in Imperfect Search by (
- Comparing AI Alignment Approaches to Minimize False Positive Risk by 30 Jun 2020 19:34 UTC; 5 points) (
- 18 Oct 2019 2:14 UTC; 4 points) 's comment on Relaxed adversarial training for inner alignment by (
- 10 Sep 2019 10:44 UTC; 4 points) 's comment on Counterfactual Oracles = online supervised learning with random selection of training episodes by (
- 25 May 2023 0:44 UTC; 3 points) 's comment on Adumbrations on AGI from an outsider by (
- 4 Oct 2019 5:58 UTC; 2 points) 's comment on Concrete experiments in inner alignment by (
It seems like the general pattern here is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal thing for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies) but conditional on mesa-optimization arising, makes it more likely that it is a pseudo-aligned mesa-optimizer instead of a robustly-aligned mesa-optimizer (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective). Example properties of this form are algorithmic range, simplicity bias, and time complexity penalties. Does that seem right?
This is backwards, I think?
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
Also, yeah, that was backwards—it should be fixed now.
Thanks for the interesting paper. I feel that the risks described are entirely plausible.
What is valuable for me in particular is that the paper re-casts many alignment risks that have already been discussed in a programmer-agent context into a new ‘inner alignment’ context. To quote the key description and separation of concerns:
That being said, I sometimes have trouble understanding how the paper defines, or does not define, the time-based relation between the base optimizer and the mesa-optimizer. I started out with a mental model where there is a one-time ‘batch’ creation operation in which the base optimizer creates the mesa-optimizer (or rather the agent which might contain a mesa-optimizer) by using simulations over a training set to compare the performance of candidate agents. The agent that scores best on the base objective is then run in the real world. However, some of Evan’s comments on mesa-optimization lead me to believe that there is sometimes a more real-time continuous adjustment relation between the base optimizer and the agent that is created. I am unclear on whether this would create additional problems, or block certain solutions.
The base-to-mesa fidelity loss problem is similar to the problem where there is a loss of fidelity between a) what the programmers actually want and b) what they encode into the base objective. However, when considering fidelity loss between b) the base objective and c) the mesa objective, I feel there is an important extra dimension. Unlike the objectives a), the objective function b) is by nature computable: it has to be computable or else the base optimizer cannot use it to select between candidates. But if the base objective function is computable at mesa-optimizer design time, it should typically also be computable at mesa-optimizer run time.
Say that the mesa optimizer is trained to control a self-driving car, or a racing car in a video game. Then while the mesa-optimizer is driving, it should be possible to evaluate the quality of the driving by using the base objective function. Whenever the base objective function shows a very low value, a safety protocol can kick in, e.g. to stop the car. The threshold of ‘very low value’ can be calibrated using the the values computed over the training set at design time.
(I can image some special cases where the base objective function is not computable while the mesa-agent runs, e.g. if the base objective function was created by hand-labeling all instances of the training set. But for many economically relevant scenarios, especially for agents that need to be good at ‘planning’, good at optimizing sequences of actions that work towards a goal, I expect that the base objective will be perfectly computable in the real world.)
So overall, while I appreciate that the paper identifies and highlights inner alignment risks, my feeling is that the analysis provided is implicitly too pessimistic about the inner alignment problem. It seems to me that some very plausible and interesting risk mitigation options, options that leverage the availability of a computable base objective function, are not being identified. The obvious statement applies: future work to chart these options would be most welcome.
The simple reason why that doesn’t work is that computing the base objective function during deployment will likely require something like actually running the model, letting it act in the real world, and then checking how good that action was. But in the case of actions that are catastrophically bad, the fact that you can check after the fact that the action was in fact bad isn’t much consolation: just having the agent take the action in the first place could be exceptionally dangerous during deployment.
I believe you might have been thinking in your reply above about a sub-set of all possible base objective functions: functions that compute a single ‘pass/fail’ value at a natural end of the training run, e.g. ‘the car never crashed’ or ‘all parts of the floor have been swept’. I was thinking of incrementally scoring objective functions, basically functions that sum utility increments achieved over time. So at any time during a run you can measure and compute the base objective function score up to that time. Monitoring this score should allow you to detect many forms of non-alignment between the base objective and the mesa objective automatically.
As mentioned, I see this as a promising technique for risk mitigation, it is not supposed to be a watertight way to eliminate all risks. The technique considered only looks at the score achieved so far. It does not run models to extrapolate and score the long-term consequences of every action: this would indeed be difficult, While observed past good performance does not guarantee future good performance, an observation of past bad performance does give you a very useful safety signal.
Isn’t IDA meant to be a solution to this problem? Do you discuss IDA anywhere, maybe in the last post?
IDA is definitely a good candidate to solve problems of this form. I think IDA’s best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you’re training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don’t discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.
Thanks! In addition to wanting your take on IDA as a potential solution to inner alignment, I also brought up my question because that place seemed like a natural place to mention/cite IDA and related ideas, and by not doing that you could give the mistaken impression that nobody has proposed a good enough candidate solution to be worth mentioning. But it might be fine if you do at least mention it in the conclusions or somewhere else.
I just added a footnote mentioning IDA to this section of the paper, though I’m leaving it as is in the sequence to avoid messing up the bibliography numbering.
This isn’t obvious to me. If the environment is fairly varied, you will probably need different proxies for the base objective in different situations. As you say, representing all these proxies directly will save on computation time, but I would expect it to have a longer description length, since each proxie needs to be specified independently (together with information on how to make tradeoffs between them). The opposite case, where a complex base objective correlates with the same proxie in a wide range of environments, seems rarer.
Using humans as an analogy, we were specified with proxy goals, and our values are extremely complicated. You mention the sensory experience of food and pain as relatively simple goals, but we also have far more complex ones, like the wish to be relatively high in a status hierarchy, the wish to not have a mate cheat on us, etc. You’re right that an innate model of genetic fitness also would have been quite complicated, though.
(Rohin mentions that most of these things follow a pattern where one extreme encourages heuristics and one extreme encourages robust mesa-optimizers, while you get pseudo-aligned mesa-optimizers in the middle. At present, simplicity breaks this pattern, since you claim that pseudo-aligned mesa-optimizers are simpler than both heuristics and robustly aligned mesa-optimizers. What I’m saying is that I think that the general pattern might hold here, as well: short description lengths might make it easier to achieve robust alignment.)
Edit: To some extent, it seems like you already agree with this, since Adversarial training points out that a sufficiently wide range of environments will have a robustly aligned agent as its simplest mesa-optimizer. Do you assume that there isn’t enough training data to identify Obase, in Compression of the mesa-optimizer? It might be good to clarify the difference between those two sections.
I think it’s very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn’t.
That being said, you might be able to get pretty close, even if you don’t hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.
More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you’re likely to find an optimum which isn’t exactly the base objective, just simply because there are so many different possible objectives for you to find, and it’s likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.
Main point:
I agree that inner alignment is a really hard problem, and that for a non-huge amount of training data, there is likely to be a proxy goal that’s simpler than the real goal. Description length still seems importantly different from e.g. computation time. If we keep optimising for the simplest learned algorithm, and gradually increase our training data towards all of the data we care about, I expect us to eventually reach a mesa-optimiser optimising for the base objective. (You seem to agree with this, in the last section?) However, if we keep optimising for the fastest learned algorithm, and gradually increase our training data towards all of the data we care about, we won’t ever get a robustly aligned system (until we’ve shown it every single datapoint that we’ll ever care about). We’ll probably just get a look-up table which acts randomly on new input.
This difference makes me think that simplicity could be a useful tool to make a robustly aligned mesa optimiser. Maybe you disagree because you think that the necessary amounts of data is so ludicrously big that we’ll never reach them, even by using adversarial training or other such tricks?
I’d be more willing to drop simplicity if we had good, generic methods to directly optimise for “pure similarity to the base objective”, but I don’t know how to do this without doing hard-coded optimisation or internals-based selection. Maybe you think the task is impossible without some version of the latter?
I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don’t think it’s absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that’s not a physical example.
Minor point:
I chose status and cheating precisely because they don’t directly refer to simple sensory data. You need complex models of your social environment in order to even have a concept of status, and I actually think it’s pretty impressive that we have enough of such models hardcoded into us to have preferences over them.
Since the original text mentions food and pain as “directly related to our input data”, I thought status hierarchies was noticeably different from them, in this way. Do tell me if you were trying to point at some other distinction (or if you don’t think status requires complex models).
I agree, status definitely seems more complicated—in that case it was just worth the extra complexity. The point, though, is just that the measure of complexity under which the mesa-objective is selected is different from more natural measures of complexity under which you might hope for the base objective to be the simplest. Thus, even though sometimes it is absolutely worth it to sacrifice simplicity, you shouldn’t usually expect that sacrifice to be in the direction of moving closer to the base objective.
What is the intuition that makes you think that despite being expoentially harder this would not be impossible?
On the one hand, this makes it sound like, instead of creating new (neurons? sets of neurons?) existing neurons are likely to be re-used. Whereas One pixel attack for fooling deep neural networks, almost seems to ask “are subprocesses with lots of dependencies* ever made?”
*High(er) level processes.