Inner Alignment

TagLast edit: Dec 30, 2024, 9:29 AM by Dakara

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

Video by Robert Miles

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 4, 2019, 1:20 AM

104 points

17 comments13 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

May 31, 2019, 11:44 PM

187 points

42 comments12 min readLW link 3 reviews

Inner Alignment: Explain like I’m 12 Edition

Rafael HarthAug 1, 2020, 3:24 PM

184 points

47 comments13 min readLW link 2 reviews

Demons in Imperfect Search

johnswentworthFeb 11, 2020, 8:25 PM

110 points

21 comments3 min readLW link

Mesa-Search vs Mesa-Control

abramdemskiAug 18, 2020, 6:51 PM

55 points

45 comments7 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworthAug 10, 2022, 4:08 PM

209 points

34 comments3 min readLW link 1 review

Reward is not the optimization target

TurnTroutJul 25, 2022, 12:03 AM

375 points

123 comments10 min readLW link 3 reviews

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM

64 points

30 comments11 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM

37 points

4 comments2 min readLW link

Searching for Search

NicholasKees and janus

Nov 28, 2022, 3:31 PM

97 points

9 comments14 min readLW link 1 review

Relaxed adversarial training for inner alignment

evhubSep 10, 2019, 11:03 PM

69 points

27 comments27 min readLW link

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam SchollAug 12, 2020, 7:47 AM

154 points

87 comments5 min readLW link

Why almost every RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM

32 points

3 comments5 min readLW link

Concrete experiments in inner alignment

evhubSep 6, 2019, 10:16 PM

74 points

12 comments6 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristianoMay 5, 2018, 10:40 PM

83 points

70 comments2 min readLW link 1 review

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM

60 points

39 comments24 min readLW link

minutes from a human-alignment meeting

bhauthMay 24, 2024, 5:01 AM

67 points

4 comments2 min readLW link

Outer vs inner misalignment: three framings

Richard_NgoJul 6, 2022, 7:46 PM

51 points

5 comments9 min readLW link

Tessellating Hills: a toy model for demons in imperfect search

DaemonicSigilFeb 20, 2020, 12:12 AM

97 points

18 comments2 min readLW link

Gradient hacking

evhubOct 16, 2019, 12:53 AM

107 points

39 comments3 min readLW link 2 reviews

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTroutDec 2, 2022, 2:43 AM

149 points

22 comments47 min readLW link 3 reviews

Are minimal circuits deceptive?

evhubSep 7, 2019, 6:11 PM

78 points

11 comments8 min readLW link

Malign generalization without internal search

Matthew BarnettJan 12, 2020, 6:03 PM

43 points

12 comments4 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron BergFeb 11, 2022, 10:23 PM

5 points

1 comment10 min readLW link

Book review: “A Thousand Brains” by Jeff Hawkins

Steven ByrnesMar 4, 2021, 5:10 AM

122 points

18 comments19 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron BergDec 7, 2021, 9:50 PM

66 points

18 comments23 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

Jun 23, 2021, 11:23 PM

63 points

5 comments9 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

Jun 23, 2021, 11:25 PM

73 points

7 comments9 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

Mar 21, 2023, 3:53 PM

38 points

6 comments10 min readLW link

Winners of AI Alignment Awards Research Contest

Orpheus16 and OliviaJ

Jul 13, 2023, 4:14 PM

115 points

4 comments12 min readLW link

(alignmentawards.com)

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

Apr 28, 2024, 1:00 PM

44 points

4 comments8 min readLW link

Inner alignment in the brain

Steven ByrnesApr 22, 2020, 1:14 PM

79 points

16 comments16 min readLW link

How likely is deceptive alignment?

evhubAug 30, 2022, 7:34 PM

104 points

28 comments60 min readLW link

Defining capability and alignment in gradient descent

Edouard HarrisNov 5, 2020, 2:36 PM

22 points

6 comments10 min readLW link

Does SGD Produce Deceptive Alignment?

Mark XuNov 6, 2020, 11:48 PM

96 points

9 comments16 min readLW link

Steering subsystems: capabilities, agency, and alignment

Seth HerdSep 29, 2023, 1:45 PM

31 points

0 comments8 min readLW link

Towards an empirical investigation of inner alignment

evhubSep 23, 2019, 8:43 PM

44 points

9 comments6 min readLW link

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus AstraJul 1, 2020, 5:30 PM

35 points

4 comments67 min readLW link

Inner Alignment in Salt-Starved Rats

Steven ByrnesNov 19, 2020, 2:40 AM

137 points

41 comments11 min readLW link 2 reviews

The (partial) fallacy of dumb superintelligence

Seth HerdOct 18, 2023, 9:25 PM

38 points

5 comments4 min readLW link

Inner alignment requires making assumptions about human values

Matthew BarnettJan 20, 2020, 6:38 PM

26 points

9 comments4 min readLW link

An overview of 11 proposals for building safe advanced AI

evhubMay 29, 2020, 8:38 PM

220 points

37 comments38 min readLW link 2 reviews

[Question] Does iterated amplification tackle the inner alignment problem?

JanBFeb 15, 2020, 12:58 PM

7 points

4 comments1 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilanFeb 18, 2021, 12:03 AM

43 points

10 comments87 min readLW link

Against evolution as an analogy for how humans will create AGI

Steven ByrnesMar 23, 2021, 12:29 PM

65 points

25 comments25 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM

74 points

40 comments16 min readLW link

Gradations of Inner Alignment Obstacles

abramdemskiApr 20, 2021, 10:18 PM

84 points

22 comments9 min readLW link

Pre-Training + Fine-Tuning Favors Deception

Mark XuMay 8, 2021, 6:36 PM

27 points

3 comments3 min readLW link

Don’t align agents to evaluations of plans

TurnTroutNov 26, 2022, 9:16 PM

48 points

49 comments18 min readLW link

Formal Inner Alignment, Prospectus

abramdemskiMay 12, 2021, 7:57 PM

95 points

57 comments16 min readLW link

Mesa-Optimizers via Grokking

orthonormalDec 6, 2022, 8:05 PM

36 points

4 comments6 min readLW link

Take 8: Queer the inner/outer alignment dichotomy.

Charlie SteinerDec 9, 2022, 5:46 PM

31 points

2 comments2 min readLW link

Reframing inner alignment

davidadDec 11, 2022, 1:53 PM

53 points

13 comments4 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven ByrnesJul 10, 2020, 4:49 PM

45 points

7 comments8 min readLW link

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin ShahJan 6, 2023, 3:48 PM

93 points

21 comments8 min readLW link

Some of my disagreements with List of Lethalities

TurnTroutJan 24, 2023, 12:25 AM

70 points

7 comments10 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven ByrnesJul 14, 2021, 3:11 PM

22 points

1 comment13 min readLW link

Re-Define Intent Alignment?

abramdemskiJul 22, 2021, 7:00 PM

32 points

32 comments4 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam ScherlisJan 31, 2023, 8:33 AM

84 points

12 comments4 min readLW link

Anomalous tokens reveal the original identities of Instruct models

janus and jdp

Feb 9, 2023, 1:30 AM

139 points

16 comments9 min readLW link

(generative.ink)

Applications for Deconfusing Goal-Directedness

adamShimiAug 8, 2021, 1:05 PM

38 points

3 comments5 min readLW link 1 review

Approaches to gradient hacking

adamShimiAug 14, 2021, 3:16 PM

16 points

8 comments8 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

Mar 11, 2023, 6:59 PM

333 points

28 comments23 min readLW link

Selection Theorems: A Program For Understanding Agents

johnswentworthSep 28, 2021, 5:03 AM

128 points

28 comments6 min readLW link 2 reviews

Goals selected from learned knowledge: an alternative to RL alignment

Seth HerdJan 15, 2024, 9:52 PM

42 points

18 comments7 min readLW link

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam ClarkeSep 28, 2021, 4:55 PM

21 points

10 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblattDec 15, 2021, 7:06 PM

16 points

15 comments27 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

127 points

9 comments15 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM

53 points

3 comments28 min readLW link

A simple case for extreme inner misalignment

Richard_NgoJul 13, 2024, 3:40 PM

84 points

41 comments7 min readLW link

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_ArmstrongMar 2, 2020, 12:16 PM

26 points

2 comments6 min readLW link

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment

DanielFilanDec 1, 2024, 6:00 AM

41 points

0 comments67 min readLW link

Difficulty classes for alignment properties

JozdienFeb 20, 2024, 9:08 AM

34 points

5 comments2 min readLW link

A more systematic case for inner misalignment

Richard_NgoJul 20, 2024, 5:03 AM

31 points

4 comments5 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven ByrnesMar 30, 2022, 1:24 PM

48 points

7 comments21 min readLW link

We have promising alignment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM

44 points

9 comments5 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraistJul 18, 2020, 10:54 PM

45 points

9 comments2 min readLW link

Goodhart’s Law Causal Diagrams

JustinShovelain and Jeremy Gillen

Apr 11, 2022, 1:52 PM

34 points

6 comments6 min readLW link

Superintelligence’s goals are likely to be random

Mikhail SaminMar 13, 2025, 10:41 PM

5 points

6 comments5 min readLW link

AI Alignment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM

126 points

6 comments35 min readLW link

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM

92 points

11 comments2 min readLW link

Clarifying the confusion around inner alignment

Rauno ArikeMay 13, 2022, 11:05 PM

31 points

0 comments11 min readLW link

Explaining inner alignment to myself

Jeremy GillenMay 24, 2022, 11:10 PM

9 points

2 comments10 min readLW link

Our new video about goal misgeneralization, plus an apology

WriterJan 14, 2025, 2:07 PM

32 points

0 comments7 min readLW link

(youtu.be)

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

May 28, 2023, 7:10 PM

39 points

14 comments26 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

Jan 16, 2025, 1:55 PM

25 points

12 comments22 min readLW link

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmageJun 15, 2023, 10:32 AM

29 points

12 comments2 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_LeongMar 25, 2024, 11:59 AM

17 points

10 comments1 min readLW link

Language for Goal Misgeneralization: Some Formalisms from my MSc Thesis

GiulioJun 14, 2024, 7:35 PM

6 points

0 comments8 min readLW link

(www.giuliostarace.com)

Comparing Four Approaches to Inner Alignment

Lucas TeixeiraJul 29, 2022, 9:06 PM

38 points

1 comment9 min readLW link

How will they feed us

meijer1973Jun 1, 2023, 8:49 AM

4 points

3 comments5 min readLW link

Examples of AI’s behaving badly

Stuart_ArmstrongJul 16, 2015, 10:01 AM

41 points

41 comments1 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDevJun 19, 2023, 2:32 AM

4 points

2 comments7 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

Jan BetleyJul 6, 2023, 4:21 PM

37 points

2 comments7 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennenJul 31, 2018, 6:30 PM

24 points

16 comments2 min readLW link

Simple alignment plan that maybe works

IknownothingJul 18, 2023, 10:48 PM

4 points

8 comments1 min readLW link

Gradient descent might see the direction of the optimum from far away

Mikhail SaminJul 28, 2023, 4:19 PM

70 points

13 comments4 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

JustausernameAug 24, 2023, 3:53 AM

1 point

0 comments6 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM

20 points

1 comment6 min readLW link

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworthOct 31, 2020, 8:18 PM

66 points

38 comments5 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

Sep 28, 2023, 7:30 PM

72 points

4 comments21 min readLW link

A Case for AI Safety via Law

JWJohnstonSep 11, 2023, 6:26 PM

17 points

12 comments4 min readLW link

Internal Target Information for AI Oversight

Paul CologneseOct 20, 2023, 2:53 PM

15 points

0 comments5 min readLW link

(Non-deceptive) Suboptimality Alignment

SodiumOct 18, 2023, 2:07 AM

5 points

1 comment9 min readLW link

Thoughts On (Solving) Deep Deception

JozdienOct 21, 2023, 10:40 PM

72 points

6 comments6 min readLW link

AI Alignment Using Reverse Simulation

Sven NilsenJan 12, 2021, 8:48 PM

0 points

0 comments1 min readLW link

Formal Solution to the Inner Alignment Problem

michaelcohenFeb 18, 2021, 2:51 PM

49 points

123 comments2 min readLW link

Response to “What does the universal prior actually look like?”

michaelcohenMay 20, 2021, 4:12 PM

37 points

33 comments18 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

Jun 16, 2021, 2:33 PM

31 points

16 comments6 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth BarnesAug 31, 2021, 11:28 PM

105 points

11 comments5 min readLW link

Obstacles to gradient hacking

leogaoSep 5, 2021, 10:42 PM

28 points

11 comments4 min readLW link

Towards Deconfusing Gradient Hacking

leogaoOct 24, 2021, 12:43 AM

39 points

3 comments12 min readLW link

Meta learning to gradient hack

Quintin PopeOct 1, 2021, 7:25 PM

55 points

11 comments3 min readLW link

The evaluation function of an AI is not its aim

Yair HalberstadtOct 10, 2021, 2:52 PM

13 points

5 comments3 min readLW link

[Question] What exactly is GPT-3′s base objective?

Daniel KokotajloNov 10, 2021, 12:57 AM

60 points

14 comments2 min readLW link

Understanding Gradient Hacking

peterbarnettDec 10, 2021, 3:58 PM

41 points

5 comments30 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kittenDec 16, 2021, 10:41 PM

22 points

10 comments21 min readLW link

Gradient Hacking via Schelling Goals

Adam ScherlisDec 28, 2021, 8:38 PM

33 points

4 comments4 min readLW link

Alignment Problems All the Way Down

peterbarnettJan 22, 2022, 12:19 AM

29 points

7 comments11 min readLW link

How complex are myopic imitators?

Vivek HebbarFeb 8, 2022, 12:00 PM

26 points

1 comment15 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:59 PM

73 points

20 comments16 min readLW link

Deceptive Agents are a Good Way to Do Things

David UdellApr 19, 2022, 6:04 PM

16 points

0 comments1 min readLW link

Why No Interesting Unaligned Singularity?

David UdellApr 20, 2022, 12:34 AM

12 points

12 comments1 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

May 5, 2022, 12:59 AM

142 points

29 comments9 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

KakiliApr 27, 2022, 10:07 PM

10 points

2 comments8 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnettMay 26, 2022, 11:22 PM

24 points

0 comments8 min readLW link

Why I’m Worried About AI

peterbarnettMay 23, 2022, 9:13 PM

22 points

2 comments12 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

Jun 27, 2022, 3:58 PM

171 points

14 comments7 min readLW link

Doom doubts—is inner alignment a likely problem?

CrissmanJun 28, 2022, 12:42 PM

6 points

7 comments1 min readLW link

The curious case of Pretty Good human inner/outer alignment

PavleMihaJul 5, 2022, 7:04 PM

41 points

45 comments4 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

Jul 12, 2022, 8:11 PM

50 points

0 comments1 min readLW link

(docs.google.com)

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM

12 points

1 comment3 min readLW link

Incoherence of unbounded selfishness

emmabJul 26, 2022, 10:27 PM

−6 points

2 comments1 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

135 points

23 comments6 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane RuthenisAug 4, 2022, 11:31 PM

38 points

1 comment13 min readLW link

Gradient descent doesn’t select for inner search

Ivan VendrovAug 13, 2022, 4:15 AM

47 points

23 comments4 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni AngelouAug 16, 2022, 4:49 AM

11 points

0 comments5 min readLW link

Broad Picture of Human Values

Thane RuthenisAug 20, 2022, 7:42 PM

42 points

6 comments10 min readLW link

Thoughts about OOD alignment

CatneeAug 24, 2022, 3:31 PM

11 points

10 comments2 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane RuthenisAug 29, 2022, 6:37 PM

14 points

2 comments3 min readLW link

Three scenarios of pseudo-alignment

Eleni AngelouSep 3, 2022, 12:47 PM

9 points

0 comments3 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

Aug 30, 2022, 8:01 PM

37 points

13 comments4 min readLW link

Framing AI Childhoods

David UdellSep 6, 2022, 11:40 PM

37 points

8 comments4 min readLW link

Can “Reward Economics” solve AI Alignment?

Q HomeSep 7, 2022, 7:58 AM

3 points

15 comments18 min readLW link

The Defender’s Advantage of Interpretability

Marius HobbhahnSep 14, 2022, 2:05 PM

41 points

4 comments6 min readLW link

Why deceptive alignment matters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM

68 points

13 comments13 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM

27 points

4 comments6 min readLW link

Inner alignment: what are we pointing at?

lemonhopeSep 18, 2022, 11:09 AM

14 points

2 comments1 min readLW link

Planning capacity and daemons

lemonhopeSep 26, 2022, 12:15 AM

2 points

0 comments5 min readLW link

LOVE in a simbox is all you need

jacob_cannellSep 28, 2022, 6:25 PM

66 points

73 comments44 min readLW link 1 review

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

Oct 7, 2022, 2:38 PM

56 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

Disentangling inner alignment failures

Erik JennerOct 10, 2022, 6:50 PM

23 points

5 comments4 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM

21 points

7 comments8 min readLW link

Science of Deep Learning—a technical agenda

Marius HobbhahnOct 18, 2022, 2:54 PM

37 points

7 comments4 min readLW link

What sorts of systems can be deceptive?

Andrei AlexandruOct 31, 2022, 10:00 PM

16 points

0 comments7 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

127 points

24 comments4 min readLW link 1 review

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

78 points

4 comments25 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram RachumNov 10, 2022, 6:41 PM

8 points

9 comments1 min readLW link

Value Formation: An Overarching Model

Thane RuthenisNov 15, 2022, 5:16 PM

34 points

20 comments34 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM

13 points

0 comments13 min readLW link

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny FernandezDec 5, 2022, 3:19 PM

19 points

5 comments7 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM

86 points

6 comments18 min readLW link

In Defense of Wrapper-Minds

Thane RuthenisDec 28, 2022, 6:28 PM

24 points

38 comments3 min readLW link

The Alignment Problems

Martín SotoJan 12, 2023, 10:29 PM

20 points

0 comments4 min readLW link

Gradient Filtering

Jozdien and janus

Jan 18, 2023, 8:09 PM

56 points

16 comments13 min readLW link

Gradient hacking is extremely difficult

berenJan 24, 2023, 3:45 PM

164 points

22 comments5 min readLW link

Medical Image Registration: The obscure field where Deep Mesaoptimizers are already at the top of the benchmarks. (post + colab notebook)

HastingsJan 30, 2023, 10:46 PM

35 points

1 comment3 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman LeventovFeb 14, 2023, 6:57 AM

6 points

0 comments2 min readLW link

(arxiv.org)

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher KingFeb 22, 2023, 4:49 PM

1 point

7 comments1 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

Dec 8, 2023, 5:08 PM

82 points

7 comments7 min readLW link

Emergent Misalignment and Emergent Alignment

Alvin ÅnestrandApr 3, 2025, 8:04 AM

5 points

0 comments8 min readLW link

A Kindness, or The Inevitable Consequence of Perfect Inference (a short story)

samhealyDec 12, 2023, 11:03 PM

6 points

0 comments9 min readLW link

Implementing Asimov’s Laws of Robotics—How I imagine alignment working.

Joshua ClancyMay 22, 2024, 11:15 PM

2 points

0 comments11 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk LimMay 23, 2024, 1:32 PM

0 points

3 comments1 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM

35 points

4 comments39 min readLW link

Alignment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM

1 point

0 comments2 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM

14 points

0 comments9 min readLW link

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen and peterbarnett

Jan 26, 2024, 7:22 AM

161 points

60 comments57 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

May 28, 2024, 5:29 AM

50 points

1 comment9 min readLW link

(arxiv.org)

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myersFeb 9, 2024, 6:40 PM

6 points

12 comments3 min readLW link

Thank you for triggering me

CissyFeb 12, 2024, 8:09 PM

6 points

1 comment6 min readLW link

(www.moremyself.xyz)

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_DietzFeb 17, 2024, 8:45 AM

4 points

0 comments13 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul CologneseFeb 22, 2024, 8:02 AM

16 points

0 comments8 min readLW link

The Inner Alignment Problem

Jakub HalmešFeb 24, 2024, 5:55 PM

1 point

1 comment3 min readLW link

(jakubhalmes.substack.com)

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM

6 points

1 comment1 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM

7 points

2 comments31 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo Nov 8, 2023, 8:10 PM

1 point

0 comments8 min readLW link

Recursive Cognitive Refinement (RCR): A Self-Correcting Approach for LLM Hallucinations

mxTheoFeb 22, 2025, 9:32 PM

0 points

0 comments2 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

May 9, 2024, 6:40 AM

4 points

0 comments5 min readLW link

Demystifying “Alignment” through a Comic

milanroskoJun 9, 2024, 8:24 AM

106 points

19 comments1 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

Jul 18, 2024, 5:02 PM

9 points

0 comments1 min readLW link

(arxiv.org)

Unaligned AGI & Brief History of Inequality

ankFeb 22, 2025, 4:26 PM

−20 points

4 comments7 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jrFeb 23, 2025, 10:03 AM

6 points

3 comments1 min readLW link

Moral gauge theory: A speculative suggestion for AI alignment

James DiacoumisFeb 23, 2025, 11:42 AM

4 points

2 comments8 min readLW link

AI Rights for Human Safety

Simon GoldsteinAug 1, 2024, 11:01 PM

45 points

6 comments1 min readLW link

(papers.ssrn.com)

[Question] What constitutes an infohazard?

K1r4d4rk.v1Oct 8, 2024, 9:29 PM

−4 points

8 comments1 min readLW link

Why humans won’t control superhuman AIs.

Spiritus DeiOct 16, 2024, 4:48 PM

−11 points

1 comment6 min readLW link

Proposing Human Survival Strategy based on the NAIA Vision: Toward the Co-evolution of Diverse Intelligences

Hiroshi YamakawaFeb 27, 2025, 5:18 AM

−2 points

0 comments11 min readLW link

Religious Persistence: A Missing Primitive for Robust Alignment

lauriewiredApr 14, 2025, 10:03 PM

1 point

3 comments8 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments27 min readLW link

I Recommend More Training Rationales

Gianluca CalcagniDec 31, 2024, 2:06 PM

2 points

0 comments6 min readLW link

The AI Agent Revolution: Beyond the Hype of 2025

DimaGJan 2, 2025, 6:55 PM

−7 points

1 comment28 min readLW link

The Hidden Cost of Our Lies to AI

Nicholas AndresenMar 6, 2025, 5:03 AM

138 points

17 comments7 min readLW link

(substack.com)

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM

35 points

3 comments9 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony DiamondMar 18, 2025, 6:03 PM

1 point

0 comments1 min readLW link

“Pick Two” AI Trilemma: Generality, Agency, Alignment.

Black FlagJan 15, 2025, 6:52 PM

7 points

0 comments2 min readLW link

What are the plans for solving the inner alignment problem?

Leonard HollowayJan 17, 2025, 9:45 PM

12 points

4 comments1 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

ShivamJan 30, 2025, 2:44 AM

1 point

0 comments11 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Aligned AI as a wrapper around an LLM

cousin_itMar 25, 2023, 3:58 PM

31 points

19 comments1 min readLW link

Are extrapolation-based AIs alignable?

cousin_itMar 24, 2023, 3:55 PM

22 points

15 comments1 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul CologneseApr 12, 2023, 3:39 PM

9 points

7 comments12 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher KingApr 20, 2023, 7:57 PM

2 points

7 comments3 min readLW link

A concise sum-up of the basic argument for AI doom

Mergimio H. DoefevmilApr 24, 2023, 5:37 PM

11 points

6 comments2 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM

14 points

5 comments10 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleerNov 1, 2023, 5:35 PM

20 points

1 comment1 min readLW link

(arxiv.org)

Open-ended/Phenomenal Ethics (TLDR)

Ryo Nov 9, 2023, 4:58 PM

3 points

0 comments1 min readLW link

Optionality approach to ethics

Ryo Nov 13, 2023, 3:23 PM

7 points

2 comments3 min readLW link

Is Interpretability All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM

1 point

1 comment1 min readLW link

Why small phenomenons are relevant to morality

Ryo Nov 13, 2023, 3:25 PM

1 point

0 comments3 min readLW link

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo Nov 25, 2023, 3:35 PM

9 points

3 comments5 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander BistagneNov 19, 2023, 7:38 PM

−5 points

4 comments1 min readLW link

(github.com)

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis JucysDec 8, 2023, 1:18 PM

13 points

1 comment4 min readLW link

(arxiv.org)

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition

Adrià MoretDec 2, 2023, 2:07 PM

26 points

31 comments42 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM

9 points

0 comments11 min readLW link

A simple environment for showing mesa misalignment

Matthew BarnettSep 26, 2019, 4:44 AM

73 points

9 comments2 min readLW link

Babies and Bunnies: A Caution About Evo-Psych

AlicornFeb 22, 2010, 1:53 AM

81 points

843 comments2 min readLW link

2-D Robustness

Vlad MikulikAug 30, 2019, 8:27 PM

85 points

8 comments2 min readLW link

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoachMay 4, 2023, 5:59 PM

4 points

0 comments7 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-YangMay 6, 2023, 4:48 AM

27 points

1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke HayashiMay 6, 2023, 5:55 PM

9 points

6 comments2 min readLW link

Reward is the optimization target (of capabilities researchers)

Max HMay 15, 2023, 3:22 AM

32 points

4 comments5 min readLW link

Simple experiments with deceptive alignment

Andreas_MoeMay 15, 2023, 5:41 PM

7 points

0 comments4 min readLW link

The Goal Misgeneralization Problem

MyspyMay 18, 2023, 11:40 PM

1 point

0 comments1 min readLW link

(drive.google.com)

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph BloomMay 16, 2023, 10:59 PM

36 points

2 comments16 min readLW link

We Shouldn’t Expect AI to Ever be Fully Rational

OneManyNoneMay 18, 2023, 5:09 PM

19 points

31 comments6 min readLW link

[Question] Is “brittle alignment” good enough?

the8thbitMay 23, 2023, 5:35 PM

9 points

5 comments3 min readLW link

Two ideas for alignment, perpetual mutual distrust and induction

APaleBlueDotMay 25, 2023, 12:56 AM

1 point

2 comments4 min readLW link

[AN #67]: Creating environments in which to study inner alignment failures

Rohin ShahOct 7, 2019, 5:10 PM

17 points

0 comments8 min readLW link

(mailchi.mp)

how humans are aligned

bhauthMay 26, 2023, 12:09 AM

14 points

3 comments1 min readLW link

Shutdown-Seeking AI

Simon GoldsteinMay 31, 2023, 10:19 PM

50 points

32 comments15 min readLW link

Linda Linsefors Oct 9, 2023, 11:34 PM
4 points
2
Inner alignment asks the question—“Is the model trying to do what humans want it to do?”
This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment.

This text also shows up in the outer alignment tag: Outer Alignment—LessWrong
- Linda Linsefors Oct 9, 2023, 11:36 PM
  2 points
  2
  Parent
  I’ve made an edit to remove this part.
  - Seth Herd Apr 1, 2024, 9:46 PM
    2 points
    0
    Parent
    I think the better phrasing would be “is the model going to do what the humans trained (or told) it to do?” (specifying a goal you really want is outer alignment).
Raemon Jun 10, 2022, 8:29 AM
2 points
I’m not actually sure about the difference here between this tag and Mesaoptimizers
- Rob Bensinger Jun 10, 2022, 12:17 PM
  2 points
  Parent
  I’m guessing the distinction was intended to be:
  - Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that’s possible)?
  - Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)
  Or ‘Inner Alignment’ is meant to be a subcategory of ‘Mesa-Optimizers’?

In­ner Alignment

Inner Alignment Vs. Outer Alignment

Related Pages:

External Links:

Inner Alignment