RSS

In­ner Alignment

TagLast edit: Dec 30, 2024, 9:29 AM by Dakara

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/​goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Related Pages:

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

The In­ner Align­ment Problem

Jun 4, 2019, 1:20 AM
104 points
17 comments13 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

May 31, 2019, 11:44 PM
187 points
42 comments12 min readLW link3 reviews

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael HarthAug 1, 2020, 3:24 PM
184 points
47 comments13 min readLW link2 reviews

De­mons in Im­perfect Search

johnswentworthFeb 11, 2020, 8:25 PM
110 points
21 comments3 min readLW link

Mesa-Search vs Mesa-Control

abramdemskiAug 18, 2020, 6:51 PM
55 points
45 comments7 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworthAug 10, 2022, 4:08 PM
209 points
34 comments3 min readLW link1 review

Re­ward is not the op­ti­miza­tion target

TurnTroutJul 25, 2022, 12:03 AM
375 points
123 comments10 min readLW link3 reviews

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
64 points
30 comments11 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

min­utes from a hu­man-al­ign­ment meeting

bhauthMay 24, 2024, 5:01 AM
67 points
4 comments2 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhubSep 10, 2019, 11:03 PM
69 points
27 comments27 min readLW link

Why al­most ev­ery RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM
32 points
3 comments5 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhubSep 6, 2019, 10:16 PM
74 points
12 comments6 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristianoMay 5, 2018, 10:40 PM
83 points
70 comments2 min readLW link1 review

Search­ing for Search

Nov 28, 2022, 3:31 PM
97 points
9 comments14 min readLW link1 review

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
60 points
39 comments24 min readLW link

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam SchollAug 12, 2020, 7:47 AM
154 points
87 comments5 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven ByrnesMar 4, 2021, 5:10 AM
122 points
18 comments19 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTroutDec 2, 2022, 2:43 AM
148 points
22 comments47 min readLW link3 reviews

Outer vs in­ner mis­al­ign­ment: three framings

Richard_NgoJul 6, 2022, 7:46 PM
51 points
5 comments9 min readLW link

Are min­i­mal cir­cuits de­cep­tive?

evhubSep 7, 2019, 6:11 PM
78 points
11 comments8 min readLW link

Tes­sel­lat­ing Hills: a toy model for demons in im­perfect search

DaemonicSigilFeb 20, 2020, 12:12 AM
97 points
18 comments2 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron BergFeb 11, 2022, 10:23 PM
5 points
1 comment10 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron BergDec 7, 2021, 9:50 PM
66 points
18 comments23 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

Jun 23, 2021, 11:23 PM
63 points
5 comments9 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

Jun 23, 2021, 11:25 PM
73 points
7 comments9 min readLW link

Gra­di­ent hacking

evhubOct 16, 2019, 12:53 AM
107 points
39 comments3 min readLW link2 reviews

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew BarnettJan 12, 2020, 6:03 PM
43 points
12 comments4 min readLW link

Com­par­ing Four Ap­proaches to In­ner Alignment

Lucas TeixeiraJul 29, 2022, 9:06 PM
38 points
1 comment9 min readLW link

Clar­ify­ing mesa-optimization

Mar 21, 2023, 3:53 PM
38 points
6 comments10 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

Jul 13, 2023, 4:14 PM
115 points
4 comments12 min readLW link
(alignmentawards.com)

Lan­guage for Goal Mis­gen­er­al­iza­tion: Some For­mal­isms from my MSc Thesis

GiulioJun 14, 2024, 7:35 PM
6 points
0 comments8 min readLW link
(www.giuliostarace.com)

In­ner al­ign­ment in the brain

Steven ByrnesApr 22, 2020, 1:14 PM
79 points
16 comments16 min readLW link

How likely is de­cep­tive al­ign­ment?

evhubAug 30, 2022, 7:34 PM
104 points
28 comments60 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard HarrisNov 5, 2020, 2:36 PM
22 points
6 comments10 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark XuNov 6, 2020, 11:48 PM
96 points
9 comments16 min readLW link

Steer­ing sub­sys­tems: ca­pa­bil­ities, agency, and alignment

Seth HerdSep 29, 2023, 1:45 PM
31 points
0 comments8 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhubSep 23, 2019, 8:43 PM
44 points
9 comments6 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus AstraJul 1, 2020, 5:30 PM
35 points
4 comments67 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven ByrnesNov 19, 2020, 2:40 AM
137 points
41 comments11 min readLW link2 reviews

The (par­tial) fal­lacy of dumb superintelligence

Seth HerdOct 18, 2023, 9:25 PM
38 points
5 comments4 min readLW link

In­ner al­ign­ment re­quires mak­ing as­sump­tions about hu­man values

Matthew BarnettJan 20, 2020, 6:38 PM
26 points
9 comments4 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhubMay 29, 2020, 8:38 PM
220 points
36 comments38 min readLW link2 reviews

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

JanBFeb 15, 2020, 12:58 PM
7 points
4 comments1 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilanFeb 18, 2021, 12:03 AM
43 points
10 comments87 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven ByrnesMar 23, 2021, 12:29 PM
65 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM
74 points
40 comments16 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemskiApr 20, 2021, 10:18 PM
84 points
22 comments9 min readLW link

Pre-Train­ing + Fine-Tun­ing Fa­vors Deception

Mark XuMay 8, 2021, 6:36 PM
27 points
3 comments3 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTroutNov 26, 2022, 9:16 PM
48 points
49 comments18 min readLW link

For­mal In­ner Align­ment, Prospectus

abramdemskiMay 12, 2021, 7:57 PM
95 points
57 comments16 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormalDec 6, 2022, 8:05 PM
36 points
4 comments6 min readLW link

Take 8: Queer the in­ner/​outer al­ign­ment di­chotomy.

Charlie SteinerDec 9, 2022, 5:46 PM
31 points
2 comments2 min readLW link

Refram­ing in­ner alignment

davidadDec 11, 2022, 1:53 PM
53 points
13 comments4 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven ByrnesJul 10, 2020, 4:49 PM
45 points
7 comments8 min readLW link

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin ShahJan 6, 2023, 3:48 PM
93 points
21 comments8 min readLW link

[Aspira­tion-based de­signs] 1. In­for­mal in­tro­duc­tion

Apr 28, 2024, 1:00 PM
44 points
4 comments8 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven ByrnesJul 14, 2021, 3:11 PM
22 points
1 comment13 min readLW link

Re-Define In­tent Align­ment?

abramdemskiJul 22, 2021, 7:00 PM
29 points
32 comments4 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam ScherlisJan 31, 2023, 8:33 AM
84 points
12 comments4 min readLW link

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

Feb 9, 2023, 1:30 AM
139 points
16 comments9 min readLW link
(generative.ink)

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimiAug 8, 2021, 1:05 PM
38 points
3 comments5 min readLW link1 review

Ap­proaches to gra­di­ent hacking

adamShimiAug 14, 2021, 3:16 PM
16 points
8 comments8 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
332 points
28 comments23 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTroutJan 24, 2023, 12:25 AM
70 points
7 comments10 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworthSep 28, 2021, 5:03 AM
128 points
28 comments6 min readLW link2 reviews

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam ClarkeSep 28, 2021, 4:55 PM
21 points
10 comments1 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblattDec 15, 2021, 7:06 PM
16 points
15 comments27 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM
127 points
9 comments15 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM
53 points
3 comments28 min readLW link

Goals se­lected from learned knowl­edge: an al­ter­na­tive to RL alignment

Seth HerdJan 15, 2024, 9:52 PM
42 points
18 comments7 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_ArmstrongMar 2, 2020, 12:16 PM
26 points
2 comments6 min readLW link

AXRP Epi­sode 39 - Evan Hub­inger on Model Or­ganisms of Misalignment

DanielFilanDec 1, 2024, 6:00 AM
41 points
0 comments67 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_NgoJul 13, 2024, 3:40 PM
84 points
41 comments7 min readLW link

Difficulty classes for al­ign­ment properties

JozdienFeb 20, 2024, 9:08 AM
34 points
5 comments2 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven ByrnesMar 30, 2022, 1:24 PM
48 points
7 comments19 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM
44 points
9 comments5 min readLW link

A more sys­tem­atic case for in­ner misalignment

Richard_NgoJul 20, 2024, 5:03 AM
31 points
4 comments5 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraistJul 18, 2020, 10:54 PM
45 points
9 comments2 min readLW link

Good­hart’s Law Causal Diagrams

Apr 11, 2022, 1:52 PM
34 points
5 comments6 min readLW link

Su­per­in­tel­li­gence’s goals are likely to be random

Mikhail SaminMar 13, 2025, 10:41 PM
3 points
6 comments5 min readLW link

AI Align­ment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM
126 points
6 comments35 min readLW link

Clar­ify­ing the con­fu­sion around in­ner alignment

Rauno ArikeMay 13, 2022, 11:05 PM
31 points
0 comments11 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy GillenMay 24, 2022, 11:10 PM
9 points
2 comments10 min readLW link

Our new video about goal mis­gen­er­al­iza­tion, plus an apology

WriterJan 14, 2025, 2:07 PM
31 points
0 comments7 min readLW link
(youtu.be)

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

May 28, 2023, 7:10 PM
39 points
14 comments26 min readLW link

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM
91 points
11 comments2 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Why “AI al­ign­ment” would bet­ter be re­named into “Ar­tifi­cial In­ten­tion re­search”

chaosmageJun 15, 2023, 10:32 AM
29 points
12 comments2 min readLW link

De­cep­tive Align­ment and Homuncularity

Jan 16, 2025, 1:55 PM
25 points
12 comments22 min readLW link

On the Con­fu­sion be­tween In­ner and Outer Misalignment

Chris_LeongMar 25, 2024, 11:59 AM
17 points
10 comments1 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_ArmstrongJul 16, 2015, 10:01 AM
41 points
41 comments1 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDevJun 19, 2023, 2:32 AM
4 points
2 comments7 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

Jan BetleyJul 6, 2023, 4:21 PM
37 points
2 comments7 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennenJul 31, 2018, 6:30 PM
24 points
16 comments2 min readLW link

Sim­ple al­ign­ment plan that maybe works

IknownothingJul 18, 2023, 10:48 PM
4 points
8 comments1 min readLW link

Gra­di­ent de­scent might see the di­rec­tion of the op­ti­mum from far away

Mikhail SaminJul 28, 2023, 4:19 PM
70 points
13 comments4 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

JustausernameAug 24, 2023, 3:53 AM
1 point
0 comments6 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM
20 points
1 comment6 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworthOct 31, 2020, 8:18 PM
66 points
38 comments5 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

Sep 28, 2023, 7:30 PM
71 points
4 comments21 min readLW link

A Case for AI Safety via Law

JWJohnstonSep 11, 2023, 6:26 PM
17 points
12 comments4 min readLW link

In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul CologneseOct 20, 2023, 2:53 PM
15 points
0 comments5 min readLW link

(Non-de­cep­tive) Subop­ti­mal­ity Alignment

SodiumOct 18, 2023, 2:07 AM
5 points
1 comment9 min readLW link

Thoughts On (Solv­ing) Deep Deception

JozdienOct 21, 2023, 10:40 PM
71 points
6 comments6 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven NilsenJan 12, 2021, 8:48 PM
0 points
0 comments1 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohenFeb 18, 2021, 2:51 PM
49 points
123 comments2 min readLW link

Re­sponse to “What does the uni­ver­sal prior ac­tu­ally look like?”

michaelcohenMay 20, 2021, 4:12 PM
37 points
33 comments18 min readLW link

In­suffi­cient Values

Jun 16, 2021, 2:33 PM
31 points
16 comments6 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth BarnesAug 31, 2021, 11:28 PM
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogaoSep 5, 2021, 10:42 PM
28 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogaoOct 24, 2021, 12:43 AM
39 points
3 comments12 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin PopeOct 1, 2021, 7:25 PM
55 points
11 comments3 min readLW link

The eval­u­a­tion func­tion of an AI is not its aim

Yair HalberstadtOct 10, 2021, 2:52 PM
13 points
5 comments3 min readLW link

[Question] What ex­actly is GPT-3′s base ob­jec­tive?

Daniel KokotajloNov 10, 2021, 12:57 AM
60 points
14 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnettDec 10, 2021, 3:58 PM
41 points
5 comments30 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kittenDec 16, 2021, 10:41 PM
22 points
10 comments21 min readLW link

Gra­di­ent Hack­ing via Schel­ling Goals

Adam ScherlisDec 28, 2021, 8:38 PM
33 points
4 comments4 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnettJan 22, 2022, 12:19 AM
29 points
7 comments11 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek HebbarFeb 8, 2022, 12:00 PM
26 points
1 comment15 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

Apr 4, 2022, 12:59 PM
73 points
20 comments16 min readLW link

De­cep­tive Agents are a Good Way to Do Things

David UdellApr 19, 2022, 6:04 PM
16 points
0 comments1 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David UdellApr 20, 2022, 12:34 AM
12 points
12 comments1 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

May 5, 2022, 12:59 AM
142 points
29 comments9 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

KakiliApr 27, 2022, 10:07 PM
10 points
2 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnettMay 26, 2022, 11:22 PM
24 points
0 comments8 min readLW link

Why I’m Wor­ried About AI

peterbarnettMay 23, 2022, 9:13 PM
22 points
2 comments12 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points
14 comments7 min readLW link

Doom doubts—is in­ner al­ign­ment a likely prob­lem?

CrissmanJun 28, 2022, 12:42 PM
6 points
7 comments1 min readLW link

The cu­ri­ous case of Pretty Good hu­man in­ner/​outer alignment

PavleMihaJul 5, 2022, 7:04 PM
41 points
45 comments4 min readLW link

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

Jul 12, 2022, 8:11 PM
50 points
0 comments1 min readLW link
(docs.google.com)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
59 points
8 comments20 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM
12 points
1 comment3 min readLW link

In­co­her­ence of un­bounded selfishness

emmabJul 26, 2022, 10:27 PM
−6 points
2 comments1 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
135 points
23 comments6 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane RuthenisAug 4, 2022, 11:31 PM
38 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan VendrovAug 13, 2022, 4:15 AM
47 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni AngelouAug 16, 2022, 4:49 AM
11 points
0 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane RuthenisAug 20, 2022, 7:42 PM
42 points
6 comments10 min readLW link

Thoughts about OOD alignment

CatneeAug 24, 2022, 3:31 PM
11 points
10 comments2 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane RuthenisAug 29, 2022, 6:37 PM
13 points
2 comments3 min readLW link

Three sce­nar­ios of pseudo-al­ign­ment

Eleni AngelouSep 3, 2022, 12:47 PM
9 points
0 comments3 min readLW link

In­ner Align­ment via Superpowers

Aug 30, 2022, 8:01 PM
37 points
13 comments4 min readLW link

Fram­ing AI Childhoods

David UdellSep 6, 2022, 11:40 PM
37 points
8 comments4 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q HomeSep 7, 2022, 7:58 AM
3 points
15 comments18 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius HobbhahnSep 14, 2022, 2:05 PM
41 points
4 comments6 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM
68 points
13 comments13 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM
27 points
4 comments6 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lemonhopeSep 18, 2022, 11:09 AM
14 points
2 comments1 min readLW link

Plan­ning ca­pac­ity and daemons

lemonhopeSep 26, 2022, 12:15 AM
2 points
0 comments5 min readLW link

LOVE in a sim­box is all you need

jacob_cannellSep 28, 2022, 6:25 PM
66 points
73 comments44 min readLW link1 review

More ex­am­ples of goal misgeneralization

Oct 7, 2022, 2:38 PM
56 points
8 comments2 min readLW link
(deepmindsafetyresearch.medium.com)

Disen­tan­gling in­ner al­ign­ment failures

Erik JennerOct 10, 2022, 6:50 PM
23 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM
21 points
7 comments8 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius HobbhahnOct 18, 2022, 2:54 PM
37 points
7 comments4 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei AlexandruOct 31, 2022, 10:00 PM
16 points
0 comments7 min readLW link

Clar­ify­ing AI X-risk

Nov 1, 2022, 11:03 AM
127 points
24 comments4 min readLW link1 review

Threat Model Liter­a­ture Review

Nov 1, 2022, 11:03 AM
78 points
4 comments25 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram RachumNov 10, 2022, 6:41 PM
8 points
9 comments1 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane RuthenisNov 15, 2022, 5:16 PM
34 points
20 comments34 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM
13 points
0 comments13 min readLW link

Aligned Be­hav­ior is not Ev­i­dence of Align­ment Past a Cer­tain Level of Intelligence

Ronny FernandezDec 5, 2022, 3:19 PM
19 points
5 comments7 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM
86 points
6 comments18 min readLW link

In Defense of Wrap­per-Minds

Thane RuthenisDec 28, 2022, 6:28 PM
24 points
38 comments3 min readLW link

The Align­ment Problems

Martín SotoJan 12, 2023, 10:29 PM
20 points
0 comments4 min readLW link

Gra­di­ent Filtering

Jan 18, 2023, 8:09 PM
55 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

berenJan 24, 2023, 3:45 PM
162 points
22 comments5 min readLW link

Med­i­cal Image Regis­tra­tion: The ob­scure field where Deep Me­saop­ti­miz­ers are already at the top of the bench­marks. (post + co­lab note­book)

HastingsJan 30, 2023, 10:46 PM
35 points
1 comment3 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman LeventovFeb 14, 2023, 6:57 AM
6 points
0 comments2 min readLW link
(arxiv.org)

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher KingFeb 22, 2023, 4:49 PM
1 point
7 comments1 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

Dec 8, 2023, 5:08 PM
81 points
7 comments7 min readLW link

A Kind­ness, or The Inevitable Con­se­quence of Perfect In­fer­ence (a short story)

samhealyDec 12, 2023, 11:03 PM
6 points
0 comments9 min readLW link

Im­ple­ment­ing Asi­mov’s Laws of Robotics—How I imag­ine al­ign­ment work­ing.

Joshua ClancyMay 22, 2024, 11:15 PM
2 points
0 comments11 min readLW link

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk LimMay 23, 2024, 1:32 PM
0 points
3 comments1 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM
35 points
4 comments39 min readLW link

Align­ment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM
1 point
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM
13 points
0 comments9 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

Jan 26, 2024, 7:22 AM
161 points
60 comments57 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

May 28, 2024, 5:29 AM
50 points
1 comment9 min readLW link
(arxiv.org)

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points
12 comments3 min readLW link

Thank you for trig­ger­ing me

CissyFeb 12, 2024, 8:09 PM
6 points
1 comment6 min readLW link
(www.moremyself.xyz)

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_DietzFeb 17, 2024, 8:45 AM
4 points
0 comments13 min readLW link

Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul CologneseFeb 22, 2024, 8:02 AM
16 points
0 comments8 min readLW link

The In­ner Align­ment Problem

Jakub HalmešFeb 24, 2024, 5:55 PM
1 point
1 comment3 min readLW link
(jakubhalmes.substack.com)

In­vi­ta­tion to the Prince­ton AI Align­ment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM
6 points
1 comment1 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM
7 points
2 comments31 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo Nov 8, 2023, 8:10 PM
1 point
0 comments8 min readLW link

Re­cur­sive Cog­ni­tive Refine­ment (RCR): A Self-Cor­rect­ing Ap­proach for LLM Hallucinations

mxTheoFeb 22, 2025, 9:32 PM
0 points
0 comments2 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

May 9, 2024, 6:40 AM
4 points
0 comments5 min readLW link

De­mys­tify­ing “Align­ment” through a Comic

milanroskoJun 9, 2024, 8:24 AM
106 points
19 comments1 min readLW link

In­ter­pretabil­ity in Ac­tion: Ex­plo­ra­tory Anal­y­sis of VPT, a Minecraft Agent

Jul 18, 2024, 5:02 PM
9 points
0 comments1 min readLW link
(arxiv.org)

Unal­igned AGI & Brief His­tory of Inequality

ankFeb 22, 2025, 4:26 PM
−20 points
4 comments7 min readLW link

[Question] Does hu­man (mis)al­ign­ment pose a sig­nifi­cant and im­mi­nent ex­is­ten­tial threat?

jrFeb 23, 2025, 10:03 AM
6 points
3 comments1 min readLW link

Mo­ral gauge the­ory: A spec­u­la­tive sug­ges­tion for AI alignment

James DiacoumisFeb 23, 2025, 11:42 AM
4 points
2 comments8 min readLW link

AI Rights for Hu­man Safety

Simon GoldsteinAug 1, 2024, 11:01 PM
45 points
6 comments1 min readLW link
(papers.ssrn.com)

[Question] What con­sti­tutes an in­fo­haz­ard?

K1r4d4rk.v1Oct 8, 2024, 9:29 PM
−4 points
8 comments1 min readLW link

Why hu­mans won’t con­trol su­per­hu­man AIs.

Spiritus DeiOct 16, 2024, 4:48 PM
−11 points
1 comment6 min readLW link

Propos­ing Hu­man Sur­vival Strat­egy based on the NAIA Vi­sion: Toward the Co-evolu­tion of Di­verse Intelligences

Hiroshi YamakawaFeb 27, 2025, 5:18 AM
−2 points
0 comments11 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM
15 points
0 comments27 min readLW link

Iden­tity Align­ment (IA) in AI

Davey MorseMar 3, 2025, 6:26 AM
1 point
4 comments1 min readLW link

I Recom­mend More Train­ing Rationales

Gianluca CalcagniDec 31, 2024, 2:06 PM
2 points
0 comments6 min readLW link

The AI Agent Revolu­tion: Beyond the Hype of 2025

DimaGJan 2, 2025, 6:55 PM
−7 points
1 comment28 min readLW link

The Hid­den Cost of Our Lies to AI

Nicholas AndresenMar 6, 2025, 5:03 AM
109 points
12 comments7 min readLW link
(substack.com)

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Per­son­al­ity-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM
35 points
3 comments9 min readLW link

PRISM: Per­spec­tive Rea­son­ing for In­te­grated Syn­the­sis and Me­di­a­tion (In­ter­ac­tive Demo)

Anthony DiamondMar 18, 2025, 6:03 PM
1 point
0 comments1 min readLW link

“Pick Two” AI Trilemma: Gen­er­al­ity, Agency, Align­ment.

Black FlagJan 15, 2025, 6:52 PM
7 points
0 comments2 min readLW link

What are the plans for solv­ing the in­ner al­ign­ment prob­lem?

Leonard HollowayJan 17, 2025, 9:45 PM
12 points
4 comments1 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

ShivamJan 30, 2025, 2:44 AM
1 point
0 comments11 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM
5 points
11 comments10 min readLW link
(tetherware.substack.com)

Aligned AI as a wrap­per around an LLM

cousin_itMar 25, 2023, 3:58 PM
31 points
19 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_itMar 24, 2023, 3:55 PM
22 points
15 comments1 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul CologneseApr 12, 2023, 3:39 PM
9 points
7 comments12 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher KingApr 20, 2023, 7:57 PM
2 points
7 comments3 min readLW link

A con­cise sum-up of the ba­sic ar­gu­ment for AI doom

Mergimio H. DoefevmilApr 24, 2023, 5:37 PM
11 points
6 comments2 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM
14 points
5 comments10 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleerNov 1, 2023, 5:35 PM
20 points
1 comment1 min readLW link
(arxiv.org)

​​ Open-ended/​Phenom­e­nal ​Ethics ​(TLDR)

Ryo Nov 9, 2023, 4:58 PM
3 points
0 comments1 min readLW link

Op­tion­al­ity ap­proach to ethics

Ryo Nov 13, 2023, 3:23 PM
7 points
2 comments3 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM
1 point
1 comment1 min readLW link

Why small phe­nomenons are rele­vant to moral­ity ​

Ryo Nov 13, 2023, 3:25 PM
1 point
0 comments3 min readLW link

Re­ac­tion to “Em­pow­er­ment is (al­most) All We Need” : an open-ended alternative

Ryo Nov 25, 2023, 3:35 PM
9 points
3 comments5 min readLW link

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander BistagneNov 19, 2023, 7:38 PM
−5 points
4 comments1 min readLW link
(github.com)

Colour ver­sus Shape Goal Mis­gen­er­al­iza­tion in Re­in­force­ment Learn­ing: A Case Study

Karolis JucysDec 8, 2023, 1:18 PM
13 points
1 comment4 min readLW link
(arxiv.org)

Tak­ing Into Ac­count Sen­tient Non-Hu­mans in AI Am­bi­tious Value Learn­ing: Sen­tien­tist Co­her­ent Ex­trap­o­lated Volition

Adrià MoretDec 2, 2023, 2:07 PM
26 points
31 comments42 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM
9 points
0 comments11 min readLW link

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew BarnettSep 26, 2019, 4:44 AM
73 points
9 comments2 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

AlicornFeb 22, 2010, 1:53 AM
81 points
843 comments2 min readLW link

2-D Robustness

Vlad MikulikAug 30, 2019, 8:27 PM
85 points
8 comments2 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoachMay 4, 2023, 5:59 PM
4 points
0 comments7 min readLW link

My preferred fram­ings for re­ward mis­speci­fi­ca­tion and goal misgeneralisation

Yi-YangMay 6, 2023, 4:48 AM
27 points
1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke HayashiMay 6, 2023, 5:55 PM
9 points
6 comments2 min readLW link

Re­ward is the op­ti­miza­tion tar­get (of ca­pa­bil­ities re­searchers)

Max HMay 15, 2023, 3:22 AM
32 points
4 comments5 min readLW link

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_MoeMay 15, 2023, 5:41 PM
7 points
0 comments4 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

MyspyMay 18, 2023, 11:40 PM
1 point
0 comments1 min readLW link
(drive.google.com)

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph BloomMay 16, 2023, 10:59 PM
36 points
2 comments16 min readLW link

We Shouldn’t Ex­pect AI to Ever be Fully Rational

OneManyNoneMay 18, 2023, 5:09 PM
19 points
31 comments6 min readLW link

[Question] Is “brit­tle al­ign­ment” good enough?

the8thbitMay 23, 2023, 5:35 PM
9 points
5 comments3 min readLW link

Two ideas for al­ign­ment, per­pet­ual mu­tual dis­trust and induction

APaleBlueDotMay 25, 2023, 12:56 AM
1 point
2 comments4 min readLW link

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

Rohin ShahOct 7, 2019, 5:10 PM
17 points
0 comments8 min readLW link
(mailchi.mp)

how hu­mans are aligned

bhauthMay 26, 2023, 12:09 AM
14 points
3 comments1 min readLW link

Shut­down-Seek­ing AI

Simon GoldsteinMay 31, 2023, 10:19 PM
50 points
32 comments15 min readLW link

How will they feed us

meijer1973Jun 1, 2023, 8:49 AM
4 points
3 comments5 min readLW link