Risks from Learned Optimization: Introduction
This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.
Motivation
The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned?
We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning.
Two questions
In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.
Whether a system is an optimizer is a property of its internal structure—what algorithm it is physically implementing—and not a property of its input-output behavior. Importantly, the fact that a system’s behavior results in some objective being maximized does not make the system an optimizer. For example, a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.(1) Rather, bottle caps have been optimized to keep water in place. The optimizer in this situation is the human that designed the bottle cap by searching through the space of possible tools for one to successfully hold water in a bottle. Similarly, image-classifying neural networks are optimized to achieve low error in their classifications, but are not, in general, themselves performing optimization.
However, it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome.[1] Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer.[2]
The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer’s objective may not transfer to the mesa-optimizer. Thus, we explore two primary questions related to the safety of mesa-optimizers:
Mesa-optimization: Under what circumstances will learned algorithms be optimizers?
Inner alignment: When a learned algorithm is an optimizer, what will its objective be, and how can it be aligned?
Once we have introduced our framework in this post, we will address the first question in the second, begin addressing the second question in the third post, and finally delve deeper into a specific aspect of the second question in the fourth post.
1.1. Base optimizers and mesa-optimizers
Conventionally, the base optimizer in a machine learning setup is some sort of gradient descent process with the goal of creating a model designed to accomplish some specific task.
Sometimes, this process will also involve some degree of meta-optimization wherein a meta-optimizer is tasked with producing a base optimizer that is itself good at optimizing systems to achieve particular goals. Specifically, we will think of a meta-optimizer as any system whose task is optimization. For example, we might design a meta-learning system to help tune our gradient descent process.(4) Though the model found by meta-optimization can be thought of as a kind of learned optimizer, it is not the form of learned optimization that we are interested in for this sequence. Rather, we are concerned with a different form of learned optimization which we call mesa-optimization.
Mesa-optimization is a conceptual dual of meta-optimization—whereas meta is Greek for “after,” mesa is Greek for “within.”[3] Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer. Unlike meta-optimization, in which the task itself is optimization, mesa-optimization is task-independent, and simply refers to any situation where the internal structure of the model ends up performing optimization because it is instrumentally useful for solving the given task.
In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Unlike the base objective, the mesa-objective is not specified directly by the programmers. Rather, the mesa-objective is simply whatever objective was found by the base optimizer that produced good performance on the training environment. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.
There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.
Figure 1.1. The relationship between the base and mesa- optimizers. The base optimizer optimizes the learned algorithm based on its performance on the base objective. In order to do so, the base optimizer may have turned this learned algorithm into a mesa-optimizer, in which case the mesa-optimizer itself runs an optimization algorithm based on its own mesa-objective. Regardless, it is the learned algorithm that directly takes actions based on its input.
Possible misunderstanding: “mesa-optimizer” does not mean “subsystem” or “subagent.” In the context of deep learning, a mesa-optimizer is simply a neural network that is implementing some optimization process and not some emergent subagent inside that neural network. Mesa-optimizers are simply a particular type of algorithm that the base optimizer might find to solve its task. Furthermore, we will generally be thinking of the base optimizer as a straightforward optimization algorithm, and not as an intelligent agent choosing to create a subagent.[4]
We distinguish the mesa-objective from a related notion that we term the behavioral objective. Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. We can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[5] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.
Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be “1 if the actions taken are those that are in fact taken by the system and 0 otherwise,”[6] and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap.[7] However, if the system is an optimizer, then it is more likely that it will have a meaningful behavioral objective. That is, to the degree that a mesa-optimizer’s output is systematically selected to optimize its mesa-objective, its behavior may look more like coherent attempts to move the world in a particular direction.[8]
A given mesa-optimizer’s mesa-objective is determined entirely by its internal workings. Once training is finished and a learned algorithm is selected, its direct output—e.g. the actions taken by an RL agent—no longer depends on the base objective. Thus, it is the mesa-objective, not the base objective, that determines a mesa-optimizer’s behavioral objective. Of course, to the degree that the learned algorithm was selected on the basis of the base objective, its output will score well on the base objective. However, in the case of a distributional shift, we should expect a mesa-optimizer’s behavior to more robustly optimize for the mesa-objective since its behavior is directly computed according to it.
As an example to illustrate the base/mesa distinction in a different domain, and the possibility of misalignment between the base and mesa- objectives, consider biological evolution. To a first approximation, evolution selects organisms according to the objective function of their inclusive genetic fitness in some environment.[9] Most of these biological organisms—plants, for example—are not “trying” to achieve anything, but instead merely implement heuristics that have been pre-selected by evolution. However, some organisms, such as humans, have behavior that does not merely consist of such heuristics but is instead also the result of goal-directed optimization algorithms implemented in the brains of these organisms. Because of this, these organisms can perform behavior that is completely novel from the perspective of the evolutionary process, such as humans building computers.
However, humans tend not to place explicit value on evolution’s objective, at least in terms of caring about their alleles’ frequency in the population. The objective function stored in the human brain is not the same as the objective function of evolution. Thus, when humans display novel behavior optimized for their own objectives, they can perform very poorly according to evolution’s objective. Making a decision not to have children is a possible example of this. Therefore, we can think of evolution as a base optimizer that produced brains—mesa-optimizers—which then actually produce organisms’ behavior—behavior that is not necessarily aligned with evolution.
1.2. The inner and outer alignment problems
In “Scalable agent alignment via reward modeling,” Leike et al. describe the concept of the “reward-result gap” as the difference between the (in their case learned) “reward model” (what we call the base objective) and the “reward function that is recovered with perfect inverse reinforcement learning” (what we call the behavioral objective).(8) That is, the reward-result gap is the fact that there can be a difference between what a learned algorithm is observed to be doing and what the programmers want it to be doing.
The problem posed by misaligned mesa-optimizers is a kind of reward-result gap. Specifically, it is the gap between the base objective and the mesa-objective (which then causes a gap between the base objective and the behavioral objective). We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. This terminology is motivated by the fact that the inner alignment problem is an alignment problem entirely internal to the machine learning system, whereas the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
It might not be necessary to solve the inner alignment problem in order to produce safe, highly capable AI systems, as it might be possible to prevent mesa-optimizers from occurring in the first place. If mesa-optimizers cannot be reliably prevented, however, then some solution to both the outer and inner alignment problems will be necessary to ensure that mesa-optimizers are aligned with the intended goal of the programmers.
1.3. Robust alignment vs. pseudo-alignment
Given enough training, a mesa-optimizer should eventually be able to produce outputs that score highly on the base objective on the training distribution. Off the training distribution, however—and even on the training distribution while it is still early in the training process—the difference could be arbitrarily large. We will use the term robustly aligned to refer to mesa-optimizers with mesa-objectives that robustly agree with the base objective across distributions and the term pseudo-aligned to refer to mesa-optimizers with mesa-objectives that agree with the base objective on past training data, but not robustly across possible future data (either in testing, deployment, or further training). For a pseudo-aligned mesa-optimizer, there will be environments in which the base and mesa- objectives diverge. Pseudo-alignment, therefore, presents a potentially dangerous robustness problem since it opens up the possibility of a machine learning system that competently takes actions to achieve something other than the intended goal when off the training distribution. That is, its capabilities might generalize while its objective does not.
For a toy example of what pseudo-alignment might look like, consider an RL agent trained on a maze navigation task where all the doors during training happen to be red. Let the base objective (reward function) be . On the training distribution, this objective is equivalent to . Consider what would happen if an agent, trained to high performance on on this task, were put in an environment where the doors are instead blue, and with some red objects that are not doors. It might generalize on , reliably navigating to the blue door in each maze (robust alignment). But it might also generalize on instead of , reliably navigating each maze to reach red objects (pseudo-alignment).[10]
1.4. Mesa-optimization as a safety problem
If pseudo-aligned mesa-optimizers may arise in advanced ML systems, as we will suggest, they could pose two critical safety problems.
Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so. This could be dangerous if such optimization leads the system to take extremal actions outside the scope of its intended behavior in trying to maximize its mesa-objective. Of particular concern are optimizers with objective functions and optimization procedures that generalize to the real world. The conditions that lead a learning algorithm to find mesa-optimizers, however, are very poorly understood. Knowing them would allow us to predict cases where mesa-optimization is more likely, as well as take measures to discourage mesa-optimization from occurring in the first place. The second post will examine some features of machine learning algorithms that might influence their likelihood of finding mesa-optimizers.
Inner alignment. Second, even in cases where it is acceptable for a base optimizer to find a mesa-optimizer, a mesa-optimizer might optimize for something other than the specified reward function. In such a case, it could produce bad behavior even if optimizing the correct reward function was known to be safe. This could happen either during training—before the mesa-optimizer gets to the point where it is aligned over the training distribution—or during testing or deployment when the system is off the training distribution. The third post will address some of the different ways in which a mesa-optimizer could be selected to optimize for something other than the specified reward function, as well as what attributes of an ML system are likely to encourage this. In the fourth post, we will discuss a possible extreme inner alignment failure—which we believe presents one of the most dangerous risks along these lines—wherein a sufficiently capable misaligned mesa-optimizer could learn to behave as if it were aligned without actually being robustly aligned. We will call this situation deceptive alignment.
It may be that pseudo-aligned mesa-optimizers are easy to address—if there exists a reliable method of aligning them, or of preventing base optimizers from finding them. However, it may also be that addressing misaligned mesa-optimizers is very difficult—the problem is not sufficiently well-understood at this point for us to know. Certainly, current ML systems do not produce dangerous mesa-optimizers, though whether future systems might is unknown. It is indeed because of these unknowns that we believe the problem is important to analyze.
The second post in the Risks from Learned Optimization Sequence, titled “Conditions for Mesa-Optimization,” can be found here.
- ↩︎
As a concrete example of what a neural network optimizer might look like, consider TreeQN.(2) TreeQN, as described in Farquhar et al., is a Q-learning agent that performs model-based planning (via tree search in a latent representation of the environment states) as part of its computation of the Q-function. Though their agent is an optimizer by design, one could imagine a similar algorithm being learned by a DQN agent with a sufficiently expressive approximator for the Q function. Universal Planning Networks, as described by Srinivas et al.,(3) provide another example of a learned system that performs optimization, though the optimization there is built-in in the form of SGD via automatic differentiation. However, research such as that in Andrychowicz et al.(4) and Duan et al.(5) demonstrate that optimization algorithms can be learned by RNNs, making it possible that a Universal Planning Networks-like agent could be entirely learned—assuming a very expressive model space—including the internal optimization steps. Note that while these examples are taken from reinforcement learning, optimization might in principle take place in any sufficiently expressive learned system.
- ↩︎
Previous work in this space has often centered around the concept of “optimization daemons,”(6) a framework that we believe is potentially misleading and hope to supplant. Notably, the term “optimization daemon” came out of discussions regarding the nature of humans and evolution, and, as a result, carries anthropomorphic connotations.
- ↩︎
The duality comes from thinking of meta-optimization as one layer above the base optimizer and mesa-optimization as one layer below.
- ↩︎
That being said, some of our considerations do still apply even in that case.
- ↩︎
Leike et al.(8) introduce the concept of an objective recovered from perfect IRL.
- ↩︎
For the formal construction of this objective, see pg. 6 in Leike et al.(8)
- ↩︎
This objective is by definition trivially optimal in any situation that the bottlecap finds itself in.
- ↩︎
Ultimately, our worry is optimization in the direction of some coherent but unsafe objective. In this sequence, we assume that search provides sufficient structure to expect coherent objectives. While we believe this is a reasonable assumption, it is unclear both whether search is necessary and whether it is sufficient. Further work examining this assumption will likely be needed.
- ↩︎
The situation with evolution is more complicated than is presented here and we do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts. More careful arguments are presented later.
- ↩︎
Of course, it might also fail to generalize at all.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by Aug 29, 2022, 1:23 AM; 413 points) (
- The case for ensuring that powerful AIs are controlled by Jan 24, 2024, 4:11 PM; 271 points) (
- The Plan by Dec 10, 2021, 11:41 PM; 260 points) (
- An overview of 11 proposals for building safe advanced AI by May 29, 2020, 8:38 PM; 214 points) (
- Selection vs Control by Jun 2, 2019, 7:01 AM; 172 points) (
- Decision theory does not imply that we get to have nice things by Oct 18, 2022, 3:04 AM; 171 points) (
- Matt Botvinick on the spontaneous emergence of learning algorithms by Aug 12, 2020, 7:47 AM; 154 points) (
- Inner and outer alignment decompose one hard problem into two extremely hard problems by Dec 2, 2022, 2:43 AM; 147 points) (
- Developmental Stages of GPTs by Jul 26, 2020, 10:03 PM; 140 points) (
- How do we become confident in the safety of a machine learning system? by Nov 8, 2021, 10:49 PM; 133 points) (
- Deconfusing Direct vs Amortised Optimization by Dec 2, 2022, 11:30 AM; 124 points) (
- AGI Safety Fundamentals curriculum and application by Oct 20, 2021, 9:45 PM; 123 points) (EA Forum;
- Reward Is Not Enough by Jun 16, 2021, 1:52 PM; 123 points) (
- Evidence of Learned Look-Ahead in a Chess-Playing Neural Network by Jun 4, 2024, 3:50 PM; 120 points) (
- Clarifying inner alignment terminology by Nov 9, 2020, 8:40 PM; 109 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 23, 2022, 8:24 PM; 102 points) (EA Forum;
- 2019 Review: Voting Results! by Feb 1, 2021, 3:10 AM; 99 points) (
- Searching for Search by Nov 28, 2022, 3:31 PM; 94 points) (
- Am I confused about the “malign universal prior” argument? by Aug 27, 2024, 11:17 PM; 92 points) (
- Ought: why it matters and ways to help by Jul 25, 2019, 6:00 PM; 89 points) (
- Disentangling Shard Theory into Atomic Claims by Jan 13, 2023, 4:23 AM; 86 points) (
- Conditions for Mesa-Optimization by Jun 1, 2019, 8:52 PM; 84 points) (
- Risks from Learned Optimization: Conclusion and Related Work by Jun 7, 2019, 7:53 PM; 82 points) (
- Updates and additions to “Embedded Agency” by Aug 29, 2020, 4:22 AM; 82 points) (
- The Core of the Alignment Problem is... by Aug 17, 2022, 8:07 PM; 76 points) (
- Optimization Concepts in the Game of Life by Oct 16, 2021, 8:51 PM; 75 points) (
- Distinguishing test from training by Nov 29, 2022, 9:41 PM; 72 points) (
- AGI Safety Fundamentals curriculum and application by Oct 20, 2021, 9:44 PM; 69 points) (
- Why deceptive alignment matters for AGI safety by Sep 15, 2022, 1:38 PM; 67 points) (
- Against evolution as an analogy for how humans will create AGI by Mar 23, 2021, 12:29 PM; 65 points) (
- Inductive biases stick around by Dec 18, 2019, 7:52 PM; 64 points) (
- Interpretability by Oct 29, 2021, 7:28 AM; 60 points) (
- Garrabrant and Shah on human modeling in AGI by Aug 4, 2021, 4:35 AM; 60 points) (
- Mesa-Search vs Mesa-Control by Aug 18, 2020, 6:51 PM; 55 points) (
- Towards a formalization of the agent structure problem by Apr 29, 2024, 8:28 PM; 55 points) (
- LLMs seem (relatively) safe by Apr 25, 2024, 10:13 PM; 53 points) (
- Ought: why it matters and ways to help by Jul 26, 2019, 1:56 AM; 52 points) (EA Forum;
- ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3 by Feb 13, 2025, 7:21 PM; 49 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 24, 2022, 1:53 AM; 48 points) (
- Trying to isolate objectives: approaches toward high-level interpretability by Jan 9, 2023, 6:33 PM; 48 points) (
- Four usages of “loss” in AI by Oct 2, 2022, 12:52 AM; 46 points) (
- Human-AI Collaboration by Oct 22, 2019, 6:32 AM; 42 points) (
- A dilemma for prosaic AI alignment by Dec 17, 2019, 10:11 PM; 42 points) (
- Prizes for last year’s 2019 Review by Dec 20, 2021, 9:58 PM; 40 points) (
- Is the term mesa optimizer too narrow? by Dec 14, 2019, 11:20 PM; 39 points) (
- Clarifying mesa-optimization by Mar 21, 2023, 3:53 PM; 38 points) (
- Exploring safe exploration by Jan 6, 2020, 9:07 PM; 37 points) (
- Decision theory does not imply that we get to have nice things by Oct 18, 2022, 3:04 AM; 36 points) (EA Forum;
- Mesa-Optimizers via Grokking by Dec 6, 2022, 8:05 PM; 36 points) (
- Learning the prior and generalization by Jul 29, 2020, 10:49 PM; 34 points) (
- Goodhart’s Law Causal Diagrams by Apr 11, 2022, 1:52 PM; 34 points) (
- Auditing games for high-level interpretability by Nov 1, 2022, 10:44 AM; 33 points) (
- Synthesizing amplification and debate by Feb 5, 2020, 10:53 PM; 33 points) (
- A quick experiment on LMs’ inductive biases in performing search by Apr 14, 2024, 3:41 AM; 32 points) (
- Dec 26, 2020, 6:41 PM; 31 points) 's comment on Unconscious Economics by (
- Impact measurement and value-neutrality verification by Oct 15, 2019, 12:06 AM; 31 points) (
- Take 8: Queer the inner/outer alignment dichotomy. by Dec 9, 2022, 5:46 PM; 28 points) (
- Distinguishing test from training by Nov 29, 2022, 9:41 PM; 27 points) (EA Forum;
- Deliberation Everywhere: Simple Examples by Jun 27, 2022, 5:26 PM; 27 points) (
- Failure modes in a shard theory alignment plan by Sep 27, 2022, 10:34 PM; 26 points) (
- A Review of In-Context Learning Hypotheses for Automated AI Alignment Research by Apr 18, 2024, 6:29 PM; 25 points) (
- Mar 16, 2022, 6:43 PM; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- Keep humans in the loop by Apr 19, 2023, 3:34 PM; 23 points) (
- Model-based RL, Desires, Brains, Wireheading by Jul 14, 2021, 3:11 PM; 22 points) (
- Dec 19, 2019, 8:00 AM; 21 points) 's comment on 2019 AI Alignment Literature Review and Charity Comparison by (
- Jul 30, 2020, 1:40 AM; 20 points) 's comment on Learning the prior and generalization by (
- Turning up the Heat on Deceptively-Misaligned AI by Jan 7, 2025, 12:13 AM; 19 points) (
- (Extremely) Naive Gradient Hacking Doesn’t Work by Dec 20, 2022, 2:35 PM; 17 points) (
- Pop Culture Alignment Research and Taxes by Apr 16, 2022, 3:45 PM; 16 points) (
- Is Fisherian Runaway Gradient Hacking? by Apr 10, 2022, 1:47 PM; 15 points) (
- Dec 30, 2020, 3:48 AM; 15 points) 's comment on Review Voting Thread by (
- What are the best arguments that AGI is on the horizon? by Feb 16, 2020, 6:12 AM; 14 points) (EA Forum;
- We cannot directly choose an AGI’s utility function by Mar 21, 2022, 10:08 PM; 13 points) (
- Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence by Jun 27, 2022, 5:25 PM; 12 points) (
- A mesa-optimization perspective on AI valence and moral patienthood by Sep 9, 2021, 10:23 PM; 10 points) (EA Forum;
- Jun 21, 2019, 3:15 PM; 10 points) 's comment on Modeling AGI Safety Frameworks with Causal Influence Diagrams by (
- Jun 28, 2019, 12:43 PM; 8 points) 's comment on Modeling AGI Safety Frameworks with Causal Influence Diagrams by (
- A few thoughts on my self-study for alignment research by Dec 30, 2022, 10:05 PM; 6 points) (
- Oct 10, 2019, 7:27 PM; 5 points) 's comment on Thoughts on “Human-Compatible” by (
- The Human’s Role in Mesa Optimization by May 9, 2024, 12:07 PM; 5 points) (
- (Non-deceptive) Suboptimality Alignment by Oct 18, 2023, 2:07 AM; 5 points) (
- Jul 27, 2020, 12:38 AM; 4 points) 's comment on Developmental Stages of GPTs by (
- Evaluating OpenAI’s alignment plans using training stories by Aug 25, 2022, 4:12 PM; 4 points) (
- Sexual Selection as a Mesa-Optimizer by Nov 29, 2024, 11:34 PM; 3 points) (
- Jan 7, 2024, 6:59 PM; 3 points) 's comment on Searching for Search by (
- Nov 22, 2019, 12:20 PM; 3 points) 's comment on What AI safety problems need solving for safe AI research assistants? by (
- Jan 12, 2020, 7:10 PM; 3 points) 's comment on Malign generalization without internal search by (
- Jul 3, 2020, 6:57 AM; 2 points) 's comment on FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI by (EA Forum;
- Jun 3, 2019, 6:26 AM; 2 points) 's comment on Does Bayes Beat Goodhart? by (
- Jun 3, 2019, 6:07 AM; 2 points) 's comment on Does Bayes Beat Goodhart? by (
- Nov 27, 2022, 6:49 AM; 2 points) 's comment on What videos should Rational Animations make? by (
- Dec 15, 2021, 11:08 PM; 2 points) 's comment on Are minimal circuits deceptive? by (
- Oct 8, 2023, 7:17 PM; 2 points) 's comment on Evaluating the historical value misspecification argument by (
- Nov 22, 2019, 3:51 AM; 2 points) 's comment on What AI safety problems need solving for safe AI research assistants? by (
- Jul 2, 2020, 7:26 PM; 2 points) 's comment on Conditions for Mesa-Optimization by (
- Sep 10, 2019, 10:31 AM; 2 points) 's comment on Counterfactual Oracles = online supervised learning with random selection of training episodes by (
- Feb 19, 2023, 7:49 AM; 2 points) 's comment on Bing chat is the AI fire alarm by (
- Mar 8, 2021, 9:14 PM; 1 point) 's comment on interstice’s Shortform by (
- Mar 11, 2023, 1:51 PM; 1 point) 's comment on Models Don’t “Get Reward” by (
- May 15, 2022, 9:15 PM; 1 point) 's comment on Agency As a Natural Abstraction by (
- Nov 5, 2019, 9:33 AM; 1 point) 's comment on What AI safety problems need solving for safe AI research assistants? by (
- Jan 20, 2021, 12:17 PM; 1 point) 's comment on Short summary of mAIry’s room by (
So, this was apparently in 2019. Given how central the ideas have become, it definitely belongs in the review.
I struggled a bit on deciding whether to nominate this sequence.
On the one hand, it brought a lot more prominence to the inner alignment problem by making an argument for it in a lot more detail than had been done before.
On the other hand, on my beliefs, the framework it presents has an overly narrow view of what counts as inner alignment, relies on a model of AI development that I do not think is accurate, causes people to say “but what about mesa optimization” in response to any advance that doesn’t involve mesa optimization even if the advance is useful for other reasons, has led to significant confusion over what exactly does and does not count as mesa optimization, and tends to cause people to take worse steps in choosing future research topics. (I expect all of these claims will be controversial.)
Still, that the conversation is happening at all is a vast improvement over the previous situation of relative (public) silence on the problem. Saying a bunch of confused thoughts is often the precursor to an actual good understanding of a topic. As such I’ve decided to nominate it for that contribution.
I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I’m not sure what model of AI development it relies that you don’t think is accurate and would be curious for details there. I’d also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I’m imagining you’re referring to things like this post, though I’d appreciate some clarification there as well.
As a preamble, I should note that I’m putting on my “critical reviewer” hat here. I’m not intentionally being negative—I am reporting my inside-view beliefs on each question—but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn’t have the same intuitions for its utility and so will usually inside-view underestimate its value.
This is also all things I’m saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I’m not trying to be “fair” to the sequence here, that is, I’m not considering what it would have been reasonable to believe at the time.
Yup, that’s right.
There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like “explicit (mechanistic) search algorithm” are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)
I don’t think this model (implicit claim?) is correct. (For comparison, I also don’t think this model would be correct if applied to human cognition.)
A couple of examples:
Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don’t expect to learn anything of interest from the first two (they aren’t the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don’t think they’d shed light on mesa-optimization / inner alignment or its solutions.
I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.
Actually, not that one. This is more like “why are you working on reward learning—even if you solved it we’d still be worried about mesa optimization”. Possibly no one believes this, but I often feel like this implication is present. I don’t have any concrete examples at the moment; it’s possible that I’m imagining it where it doesn’t exist, or that this is only a fact about how I interpret other people rather than what they actually believe.
I know it’s already been nominated twice, but I still want to nominate it again. This sequence (I’m nominating the sequence) helped me think clearly about optimization, and how delegation works between an optimizer and mesa-optimizer, and what constraints lie between them (e.g. when does an optimizer want a system it’s developing to do optimization?). Changed a lot of the basic ways in which I think about optimization and AI.