Clarifying inner alignment terminology
I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification.
Here’s my diagram of how I think the various concepts should fit together:
The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:
And here are all my definitions of the relevant terms which I think produce those implications:
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective[1] is impact aligned with humans.
Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.[2]
Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[3]
Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
And an explanation of each of the diagram’s implications:
: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it’s mesa-objective is aligned with the base, then it’s behavioral objective should be too.
: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model’s behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model’s behavioral objective must be aligned with humans, which is precisely intent alignment.
: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.
FAQ
If a model is both outer and inner aligned, what does that imply?
Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we’re missing capability robustness.
Can impact alignment be split into outer alignment and inner alignment?
No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.
Does a model have to be inner aligned to be impact aligned?
No—we only need inner alignment if we’re dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, the diagram tells us that we can get the same exact thing if we substitute objective robustness for inner alignment—and while inner alignment implies objective robustness, the converse is not true.
How does this breakdown distinguish between the general concept of inner alignment as failing “when your capabilities generalize but your objective does not” and the more specific concept of inner alignment as “eliminating the base-mesa objective gap?”[4]
Only the more specific definition is inner alignment. Under this set of terminology, the more general definition instead refers to objective robustness, of which inner alignment is only a subproblem.
What type of problem is deceptive alignment?[5]
Inner alignment—assuming that deception requires mesa-optimization. If we relax that assumption, then it becomes an objective robustness problem. Since deception is a problem with the model trying to do the wrong thing, it’s clearly an intent alignment problem rather than a capability robustness problem—and see here for an explanation of why deception is never an outer alignment problem. Thus, it has to be an objective robustness problem—and if we’re dealing with a mesa-optimizer, an inner alignment problem.
What type of problem is training a model to maximize paperclips?
Outer alignment—maximizing paperclips isn’t an aligned objective even in the limit of infinite data.
How does this picture relate to a more robustness-centric version?
The above diagram can easily be reorganized into an equivalent, more robustness-centric version, which I’ve included below. This diagram is intended to be fully compatible with the above diagram—using the exact same definitions of all the terms as given above—but with robustness given a more central role, replacing the central role of intent alignment in the above diagram.
Edit: Previously I had this diagram only in a footnote, but I decided it was useful enough to promote it to the main body.
- ↩︎
The point of talking about the “optimal policy for a behavioral objective” is to reference what an agent’s behavior would look like if it never made any “mistakes.” Primarily, I mean this just in that intuitive sense, but we can also try to build a somewhat more rigorous picture if we imagine using perfect IRL in the limit of infinite data to recover a behavioral objective and then look at the optimal policy under that objective.
- ↩︎
What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences.
- ↩︎
Note that robustness as a whole isn’t included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see the alternative diagram in the FAQ.
- ↩︎
See here for an example of this confusion regarding the more general vs. more specific uses of inner alignment.
- ↩︎
See here for an example of this confusion regarding deceptive alignment.
- A bird’s eye view of ARC’s research by Oct 23, 2024, 3:50 PM; 119 points) (
- Voting Results for the 2020 Review by Feb 2, 2022, 6:37 PM; 108 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 23, 2022, 8:24 PM; 102 points) (EA Forum;
- Modelling Transformative AI Risks (MTAIR) Project: Introduction by Aug 16, 2021, 7:12 AM; 91 points) (
- A guide to Iterated Amplification & Debate by Nov 15, 2020, 5:14 PM; 75 points) (
- 2020 Review Article by Jan 14, 2022, 4:58 AM; 74 points) (
- Non-Obstruction: A Simple Concept Motivating Corrigibility by Nov 21, 2020, 7:35 PM; 74 points) (
- Distinguishing AI takeover scenarios by Sep 8, 2021, 4:19 PM; 74 points) (
- Discussion: Objective Robustness and Inner Alignment Terminology by Jun 23, 2021, 11:25 PM; 73 points) (
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by Aug 5, 2024, 3:38 PM; 66 points) (
- Open Problems with Myopia by Mar 10, 2021, 6:38 PM; 66 points) (
- Theoretical Neuroscience For Alignment Theory by Dec 7, 2021, 9:50 PM; 66 points) (
- Empirical Observations of Objective Robustness Failures by Jun 23, 2021, 11:23 PM; 63 points) (
- What exactly is GPT-3′s base objective? by Nov 10, 2021, 12:57 AM; 60 points) (
- My Overview of the AI Alignment Landscape: Threat Models by Dec 25, 2021, 11:07 PM; 53 points) (
- Conflating value alignment and intent alignment is causing confusion by Sep 5, 2024, 4:39 PM; 48 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 24, 2022, 1:53 AM; 48 points) (
- Refactoring Alignment (attempt #2) by Jul 26, 2021, 8:12 PM; 46 points) (
- Modeling Risks From Learned Optimization by Oct 12, 2021, 8:54 PM; 45 points) (
- Apr 4, 2022, 9:19 PM; 41 points) 's comment on Call For Distillers by (
- When is intent alignment sufficient or necessary to reduce AGI conflict? by Sep 14, 2022, 7:39 PM; 40 points) (
- Comparing Four Approaches to Inner Alignment by Jul 29, 2022, 9:06 PM; 38 points) (
- Clarifying the confusion around inner alignment by May 13, 2022, 11:05 PM; 31 points) (
- Re-Define Intent Alignment? by Jul 22, 2021, 7:00 PM; 29 points) (
- Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability by Jun 8, 2021, 7:20 PM; 28 points) (
- Training goals for large language models by Jul 18, 2022, 7:09 AM; 28 points) (
- Aligning AI by optimizing for “wisdom” by Jun 27, 2023, 3:20 PM; 27 points) (
- Introduction to inaccessible information by Dec 9, 2021, 1:28 AM; 27 points) (
- What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment by Sep 8, 2022, 3:04 PM; 26 points) (
- Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety by May 31, 2023, 9:18 PM; 26 points) (
- [AN #125]: Neural network scaling laws across multiple modalities by Nov 11, 2020, 6:20 PM; 25 points) (
- Embedding safety in ML development by Oct 31, 2022, 12:27 PM; 24 points) (
- Jun 24, 2021, 10:40 PM; 21 points) 's comment on Discussion: Objective Robustness and Inner Alignment Terminology by (
- Collection of arguments to expect (outer and inner) alignment failure? by Sep 28, 2021, 4:55 PM; 19 points) (
- Cheat sheet of AI X-risk by Jun 29, 2023, 4:28 AM; 19 points) (
- An AI-in-a-box success model by Apr 11, 2022, 10:28 PM; 16 points) (
- Mapping the Conceptual Territory in AI Existential Safety and Alignment by Feb 12, 2021, 7:55 AM; 15 points) (
- Inner alignment: what are we pointing at? by Sep 18, 2022, 11:09 AM; 14 points) (
- May 13, 2021, 5:36 PM; 13 points) 's comment on Formal Inner Alignment, Prospectus by (
- The Inter-Agent Facet of AI Alignment by Sep 18, 2022, 8:39 PM; 12 points) (
- Nov 9, 2020, 8:41 PM; 6 points) 's comment on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by (
- Aug 25, 2024, 7:32 PM; 4 points) 's comment on If we solve alignment, do we die anyway? by (
- are there 2 types of alignment? by Jan 23, 2025, 12:08 AM; 2 points) (
- Jan 6, 2023, 8:59 PM; 2 points) 's comment on Categorizing failures as “outer” or “inner” misalignment is often confused by (
- Dec 15, 2021, 11:08 PM; 2 points) 's comment on Are minimal circuits deceptive? by (
- Oct 8, 2023, 7:17 PM; 2 points) 's comment on Evaluating the historical value misspecification argument by (
- Apr 26, 2023, 10:43 PM; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- Aug 12, 2022, 9:19 PM; 1 point) 's comment on How much alignment data will we need in the long run? by (
This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.
In the following, I’ll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.
This one is more or less clear. Even though it’s not a formal definition, it doesn’t have to be: after all, this is precisely the problem we are trying to solve.
The “behavioral objective” is defined in a linked page as:
This is already thorny territory, since it’s far from clear what is “perfect inverse reinforcement learning”. Intuitively, an “intent aligned” agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.
This is confusing because it’s unclear what counts as “well” and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you’re still constraining the distribution somehow. I’m guessing that either this agent is doing online learning or it’s detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.
Notably, the post asserts the implication intent alignment + capability robustness ⇒ impact alignment. Now, let’s go back to the example of the misguided AI researcher. In what sense are they not “capability robust”? I don’t know.
The “mesa-objective” is defined in the linked page as:
So it seems like we could replace “mesa-objective” with just “objective”. This is confusing, because in other places the author felt the need to use “behavioral objective” but here he is referring to some other notion of objective, and it’s not clear what’s the difference.
I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult!