Clarifying inner alignment terminology
I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification.
Here’s my diagram of how I think the various concepts should fit together:
The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:
And here are all my definitions of the relevant terms which I think produce those implications:
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective[1] is impact aligned with humans.
Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.[2]
Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[3]
Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
And an explanation of each of the diagram’s implications:
: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it’s mesa-objective is aligned with the base, then it’s behavioral objective should be too.
: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model’s behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model’s behavioral objective must be aligned with humans, which is precisely intent alignment.
: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.
FAQ
If a model is both outer and inner aligned, what does that imply?
Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we’re missing capability robustness.
Can impact alignment be split into outer alignment and inner alignment?
No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.
Does a model have to be inner aligned to be impact aligned?
No—we only need inner alignment if we’re dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, the diagram tells us that we can get the same exact thing if we substitute objective robustness for inner alignment—and while inner alignment implies objective robustness, the converse is not true.
How does this breakdown distinguish between the general concept of inner alignment as failing “when your capabilities generalize but your objective does not” and the more specific concept of inner alignment as “eliminating the base-mesa objective gap?”[4]
Only the more specific definition is inner alignment. Under this set of terminology, the more general definition instead refers to objective robustness, of which inner alignment is only a subproblem.
What type of problem is deceptive alignment?[5]
Inner alignment—assuming that deception requires mesa-optimization. If we relax that assumption, then it becomes an objective robustness problem. Since deception is a problem with the model trying to do the wrong thing, it’s clearly an intent alignment problem rather than a capability robustness problem—and see here for an explanation of why deception is never an outer alignment problem. Thus, it has to be an objective robustness problem—and if we’re dealing with a mesa-optimizer, an inner alignment problem.
What type of problem is training a model to maximize paperclips?
Outer alignment—maximizing paperclips isn’t an aligned objective even in the limit of infinite data.
How does this picture relate to a more robustness-centric version?
The above diagram can easily be reorganized into an equivalent, more robustness-centric version, which I’ve included below. This diagram is intended to be fully compatible with the above diagram—using the exact same definitions of all the terms as given above—but with robustness given a more central role, replacing the central role of intent alignment in the above diagram.
Edit: Previously I had this diagram only in a footnote, but I decided it was useful enough to promote it to the main body.
- ↩︎
The point of talking about the “optimal policy for a behavioral objective” is to reference what an agent’s behavior would look like if it never made any “mistakes.” Primarily, I mean this just in that intuitive sense, but we can also try to build a somewhat more rigorous picture if we imagine using perfect IRL in the limit of infinite data to recover a behavioral objective and then look at the optimal policy under that objective.
- ↩︎
What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences.
- ↩︎
Note that robustness as a whole isn’t included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see the alternative diagram in the FAQ.
- ↩︎
See here for an example of this confusion regarding the more general vs. more specific uses of inner alignment.
- ↩︎
See here for an example of this confusion regarding deceptive alignment.
- A bird’s eye view of ARC’s research by 23 Oct 2024 15:50 UTC; 118 points) (
- Voting Results for the 2020 Review by 2 Feb 2022 18:37 UTC; 108 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 23 Apr 2022 20:24 UTC; 102 points) (EA Forum;
- Modelling Transformative AI Risks (MTAIR) Project: Introduction by 16 Aug 2021 7:12 UTC; 91 points) (
- A guide to Iterated Amplification & Debate by 15 Nov 2020 17:14 UTC; 75 points) (
- 2020 Review Article by 14 Jan 2022 4:58 UTC; 74 points) (
- Non-Obstruction: A Simple Concept Motivating Corrigibility by 21 Nov 2020 19:35 UTC; 74 points) (
- Distinguishing AI takeover scenarios by 8 Sep 2021 16:19 UTC; 74 points) (
- Discussion: Objective Robustness and Inner Alignment Terminology by 23 Jun 2021 23:25 UTC; 73 points) (
- Open Problems with Myopia by 10 Mar 2021 18:38 UTC; 66 points) (
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by 5 Aug 2024 15:38 UTC; 65 points) (
- Theoretical Neuroscience For Alignment Theory by 7 Dec 2021 21:50 UTC; 65 points) (
- Empirical Observations of Objective Robustness Failures by 23 Jun 2021 23:23 UTC; 63 points) (
- What exactly is GPT-3′s base objective? by 10 Nov 2021 0:57 UTC; 60 points) (
- My Overview of the AI Alignment Landscape: Threat Models by 25 Dec 2021 23:07 UTC; 53 points) (
- Conflating value alignment and intent alignment is causing confusion by 5 Sep 2024 16:39 UTC; 48 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 24 Apr 2022 1:53 UTC; 48 points) (
- Refactoring Alignment (attempt #2) by 26 Jul 2021 20:12 UTC; 46 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- 4 Apr 2022 21:19 UTC; 41 points) 's comment on Call For Distillers by (
- When is intent alignment sufficient or necessary to reduce AGI conflict? by 14 Sep 2022 19:39 UTC; 40 points) (
- Comparing Four Approaches to Inner Alignment by 29 Jul 2022 21:06 UTC; 38 points) (
- Clarifying the confusion around inner alignment by 13 May 2022 23:05 UTC; 31 points) (
- Re-Define Intent Alignment? by 22 Jul 2021 19:00 UTC; 29 points) (
- Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability by 8 Jun 2021 19:20 UTC; 28 points) (
- Training goals for large language models by 18 Jul 2022 7:09 UTC; 28 points) (
- Aligning AI by optimizing for “wisdom” by 27 Jun 2023 15:20 UTC; 27 points) (
- Introduction to inaccessible information by 9 Dec 2021 1:28 UTC; 27 points) (
- What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment by 8 Sep 2022 15:04 UTC; 26 points) (
- Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety by 31 May 2023 21:18 UTC; 26 points) (
- [AN #125]: Neural network scaling laws across multiple modalities by 11 Nov 2020 18:20 UTC; 25 points) (
- Embedding safety in ML development by 31 Oct 2022 12:27 UTC; 24 points) (
- 24 Jun 2021 22:40 UTC; 21 points) 's comment on Discussion: Objective Robustness and Inner Alignment Terminology by (
- Collection of arguments to expect (outer and inner) alignment failure? by 28 Sep 2021 16:55 UTC; 19 points) (
- Cheat sheet of AI X-risk by 29 Jun 2023 4:28 UTC; 19 points) (
- An AI-in-a-box success model by 11 Apr 2022 22:28 UTC; 16 points) (
- Mapping the Conceptual Territory in AI Existential Safety and Alignment by 12 Feb 2021 7:55 UTC; 15 points) (
- Inner alignment: what are we pointing at? by 18 Sep 2022 11:09 UTC; 14 points) (
- The Inter-Agent Facet of AI Alignment by 18 Sep 2022 20:39 UTC; 12 points) (
- 9 Nov 2020 20:41 UTC; 6 points) 's comment on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by (
- 6 Jan 2023 20:59 UTC; 2 points) 's comment on Categorizing failures as “outer” or “inner” misalignment is often confused by (
- 15 Dec 2021 23:08 UTC; 2 points) 's comment on Are minimal circuits deceptive? by (
- 26 Apr 2023 22:43 UTC; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- 12 Aug 2022 21:19 UTC; 1 point) 's comment on How much alignment data will we need in the long run? by (
This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.
In the following, I’ll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.
This one is more or less clear. Even though it’s not a formal definition, it doesn’t have to be: after all, this is precisely the problem we are trying to solve.
The “behavioral objective” is defined in a linked page as:
This is already thorny territory, since it’s far from clear what is “perfect inverse reinforcement learning”. Intuitively, an “intent aligned” agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.
This is confusing because it’s unclear what counts as “well” and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you’re still constraining the distribution somehow. I’m guessing that either this agent is doing online learning or it’s detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.
Notably, the post asserts the implication intent alignment + capability robustness ⇒ impact alignment. Now, let’s go back to the example of the misguided AI researcher. In what sense are they not “capability robust”? I don’t know.
The “mesa-objective” is defined in the linked page as:
So it seems like we could replace “mesa-objective” with just “objective”. This is confusing, because in other places the author felt the need to use “behavioral objective” but here he is referring to some other notion of objective, and it’s not clear what’s the difference.
I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult!