RSS

In­ner Alignment

TagLast edit: 9 Oct 2023 23:35 UTC by Linda Linsefors

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

More specifically, Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/​goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Related Pages:

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
103 points
17 comments13 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
185 points
42 comments12 min readLW link3 reviews

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
181 points
47 comments13 min readLW link2 reviews

De­mons in Im­perfect Search

johnswentworth11 Feb 2020 20:25 UTC
107 points
21 comments3 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
55 points
45 comments7 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
58 points
39 comments24 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
204 points
34 comments3 min readLW link1 review

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
376 points
123 comments10 min readLW link3 reviews

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
64 points
30 comments11 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

Why al­most ev­ery RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC
32 points
3 comments5 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
83 points
70 comments2 min readLW link1 review

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
154 points
87 comments5 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
94 points
9 comments14 min readLW link1 review

Con­crete ex­per­i­ments in in­ner alignment

evhub6 Sep 2019 22:16 UTC
74 points
12 comments6 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
69 points
27 comments27 min readLW link

min­utes from a hu­man-al­ign­ment meeting

bhauth24 May 2024 5:01 UTC
66 points
4 comments2 min readLW link

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew Barnett12 Jan 2020 18:03 UTC
43 points
12 comments4 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
147 points
22 comments47 min readLW link3 reviews

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
65 points
18 comments23 min readLW link

Tes­sel­lat­ing Hills: a toy model for demons in im­perfect search

DaemonicSigil20 Feb 2020 0:12 UTC
97 points
18 comments2 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
51 points
5 comments9 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

23 Jun 2021 23:23 UTC
63 points
5 comments9 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
73 points
7 comments9 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
106 points
39 comments3 min readLW link2 reviews

Are min­i­mal cir­cuits de­cep­tive?

evhub7 Sep 2019 18:11 UTC
78 points
11 comments8 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
116 points
18 comments19 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

In­ner al­ign­ment re­quires mak­ing as­sump­tions about hu­man values

Matthew Barnett20 Jan 2020 18:38 UTC
26 points
9 comments4 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
96 points
9 comments16 min readLW link

Steer­ing sub­sys­tems: ca­pa­bil­ities, agency, and alignment

Seth Herd29 Sep 2023 13:45 UTC
26 points
0 comments8 min readLW link

How likely is de­cep­tive al­ign­ment?

evhub30 Aug 2022 19:34 UTC
103 points
28 comments60 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
213 points
36 comments38 min readLW link2 reviews

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

JanB15 Feb 2020 12:58 UTC
7 points
4 comments1 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
137 points
41 comments11 min readLW link2 reviews

The (par­tial) fal­lacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC
38 points
5 comments4 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
43 points
10 comments87 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
65 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
74 points
40 comments16 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
81 points
22 comments9 min readLW link

Pre-Train­ing + Fine-Tun­ing Fa­vors Deception

Mark Xu8 May 2021 18:36 UTC
27 points
3 comments3 min readLW link

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
95 points
57 comments16 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
85 points
41 comments7 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
45 points
49 comments18 min readLW link

On the Con­fu­sion be­tween In­ner and Outer Misalignment

Chris_Leong25 Mar 2024 11:59 UTC
17 points
10 comments1 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormal6 Dec 2022 20:05 UTC
36 points
4 comments6 min readLW link

Take 8: Queer the in­ner/​outer al­ign­ment di­chotomy.

Charlie Steiner9 Dec 2022 17:46 UTC
28 points
2 comments2 min readLW link

Refram­ing in­ner alignment

davidad11 Dec 2022 13:53 UTC
53 points
13 comments4 min readLW link

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin Shah6 Jan 2023 15:48 UTC
93 points
21 comments8 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
22 points
1 comment13 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
70 points
7 comments10 min readLW link

Re-Define In­tent Align­ment?

abramdemski22 Jul 2021 19:00 UTC
29 points
32 comments4 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
38 points
3 comments5 min readLW link1 review

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC
84 points
12 comments4 min readLW link

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

9 Feb 2023 1:30 UTC
139 points
16 comments9 min readLW link
(generative.ink)

[Aspira­tion-based de­signs] 1. In­for­mal in­tro­duc­tion

28 Apr 2024 13:00 UTC
41 points
4 comments8 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
123 points
28 comments6 min readLW link2 reviews

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
328 points
27 comments23 min readLW link

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
21 points
10 comments1 min readLW link

A more sys­tem­atic case for in­ner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC
31 points
4 comments5 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
59 points
8 comments2 min readLW link
(arxiv.org)

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC
16 points
15 comments27 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Goals se­lected from learned knowl­edge: an al­ter­na­tive to RL alignment

Seth Herd15 Jan 2024 21:52 UTC
42 points
18 comments7 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
52 points
3 comments28 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC
40 points
9 comments5 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
34 points
5 comments2 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
48 points
7 comments19 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
9 comments2 min readLW link

Good­hart’s Law Causal Diagrams

11 Apr 2022 13:52 UTC
34 points
5 comments6 min readLW link

Re­sults from the Tur­ing Sem­i­nar hackathon

7 Dec 2023 14:50 UTC
29 points
1 comment6 min readLW link

AXRP Epi­sode 39 - Evan Hub­inger on Model Or­ganisms of Misalignment

DanielFilan1 Dec 2024 6:00 UTC
41 points
0 comments67 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
39 points
14 comments26 min readLW link

Clar­ify­ing the con­fu­sion around in­ner alignment

Rauno Arike13 May 2022 23:05 UTC
31 points
0 comments11 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy Gillen24 May 2022 23:10 UTC
9 points
2 comments10 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
38 points
6 comments10 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
91 points
11 comments2 min readLW link

Why “AI al­ign­ment” would bet­ter be re­named into “Ar­tifi­cial In­ten­tion re­search”

chaosmage15 Jun 2023 10:32 UTC
29 points
12 comments2 min readLW link

Lan­guage for Goal Mis­gen­er­al­iza­tion: Some For­mal­isms from my MSc Thesis

Giulio14 Jun 2024 19:35 UTC
4 points
0 comments8 min readLW link
(www.giuliostarace.com)

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
79 points
16 comments16 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

13 Jul 2023 16:14 UTC
115 points
4 comments12 min readLW link
(alignmentawards.com)

Com­par­ing Four Ap­proaches to In­ner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC
38 points
1 comment9 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhub23 Sep 2019 20:43 UTC
44 points
9 comments6 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC
0 points
0 comments1 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
49 points
123 comments2 min readLW link

Re­sponse to “What does the uni­ver­sal prior ac­tu­ally look like?”

michaelcohen20 May 2021 16:12 UTC
37 points
33 comments18 min readLW link

In­suffi­cient Values

16 Jun 2021 14:33 UTC
31 points
16 comments6 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
28 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
39 points
3 comments12 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
55 points
11 comments3 min readLW link

The eval­u­a­tion func­tion of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC
13 points
5 comments3 min readLW link

[Question] What ex­actly is GPT-3′s base ob­jec­tive?

Daniel Kokotajlo10 Nov 2021 0:57 UTC
60 points
14 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
41 points
5 comments30 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kitten16 Dec 2021 22:41 UTC
22 points
10 comments21 min readLW link

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
33 points
4 comments4 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
29 points
7 comments11 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
73 points
20 comments16 min readLW link

De­cep­tive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC
16 points
0 comments1 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David Udell20 Apr 2022 0:34 UTC
12 points
12 comments1 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
58 points
0 comments59 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

Why I’m Wor­ried About AI

peterbarnett23 May 2022 21:13 UTC
22 points
2 comments12 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
171 points
14 comments7 min readLW link

Doom doubts—is in­ner al­ign­ment a likely prob­lem?

Crissman28 Jun 2022 12:42 UTC
6 points
7 comments1 min readLW link

The cu­ri­ous case of Pretty Good hu­man in­ner/​outer alignment

PavleMiha5 Jul 2022 19:04 UTC
41 points
45 comments4 min readLW link

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

12 Jul 2022 20:11 UTC
50 points
0 comments1 min readLW link
(docs.google.com)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
59 points
8 comments20 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

In­co­her­ence of un­bounded selfishness

emmab26 Jul 2022 22:27 UTC
−6 points
2 comments1 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
130 points
23 comments6 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
38 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
47 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni Angelou16 Aug 2022 4:49 UTC
11 points
0 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC
11 points
10 comments2 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC
13 points
2 comments3 min readLW link

Three sce­nar­ios of pseudo-al­ign­ment

Eleni Angelou3 Sep 2022 12:47 UTC
9 points
0 comments3 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
67 points
13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lemonhope18 Sep 2022 11:09 UTC
14 points
2 comments1 min readLW link

Plan­ning ca­pac­ity and daemons

lemonhope26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
64 points
72 comments44 min readLW link1 review

More ex­am­ples of goal misgeneralization

7 Oct 2022 14:38 UTC
56 points
8 comments2 min readLW link
(deepmindsafetyresearch.medium.com)

Disen­tan­gling in­ner al­ign­ment failures

Erik Jenner10 Oct 2022 18:50 UTC
23 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
18 points
7 comments8 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
36 points
7 comments4 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
16 points
0 comments7 min readLW link

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
127 points
24 comments4 min readLW link1 review

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
77 points
4 comments25 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC
8 points
9 comments1 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
34 points
20 comments34 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

Aligned Be­hav­ior is not Ev­i­dence of Align­ment Past a Cer­tain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC
19 points
5 comments7 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
86 points
6 comments18 min readLW link

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
24 points
38 comments3 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
20 points
0 comments4 min readLW link

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
54 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
162 points
22 comments5 min readLW link

Med­i­cal Image Regis­tra­tion: The ob­scure field where Deep Me­saop­ti­miz­ers are already at the top of the bench­marks. (post + co­lab note­book)

Hastings30 Jan 2023 22:46 UTC
34 points
1 comment3 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link
(arxiv.org)

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher King22 Feb 2023 16:49 UTC
1 point
7 comments1 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

8 Dec 2023 17:08 UTC
81 points
7 comments7 min readLW link

A Kind­ness, or The Inevitable Con­se­quence of Perfect In­fer­ence (a short story)

samhealy12 Dec 2023 23:03 UTC
6 points
0 comments9 min readLW link

Im­ple­ment­ing Asi­mov’s Laws of Robotics—How I imag­ine al­ign­ment work­ing.

Joshua Clancy22 May 2024 23:15 UTC
2 points
0 comments11 min readLW link

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk Lim23 May 2024 13:32 UTC
0 points
3 comments1 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC
34 points
4 comments39 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

A con­ver­sa­tion with Claude3 about its consciousness

rife5 Mar 2024 19:44 UTC
−4 points
3 comments1 min readLW link
(i.imgur.com)

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC
13 points
0 comments9 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

26 Jan 2024 7:22 UTC
161 points
60 comments57 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

28 May 2024 5:29 UTC
50 points
1 comment9 min readLW link
(arxiv.org)

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

Thank you for trig­ger­ing me

Cissy12 Feb 2024 20:09 UTC
5 points
1 comment6 min readLW link
(www.moremyself.xyz)

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_Dietz17 Feb 2024 8:45 UTC
4 points
0 comments13 min readLW link

Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC
16 points
0 comments8 min readLW link

The In­ner Align­ment Problem

Jakub Halmeš24 Feb 2024 17:55 UTC
1 point
1 comment3 min readLW link
(jakubhalmes.substack.com)

In­vi­ta­tion to the Prince­ton AI Align­ment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC
6 points
1 comment1 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC
7 points
2 comments31 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo 8 Nov 2023 20:10 UTC
1 point
0 comments8 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

9 May 2024 6:40 UTC
4 points
0 comments5 min readLW link

De­mys­tify­ing “Align­ment” through a Comic

milanrosko9 Jun 2024 8:24 UTC
106 points
19 comments1 min readLW link

In­ter­pretabil­ity in Ac­tion: Ex­plo­ra­tory Anal­y­sis of VPT, a Minecraft Agent

18 Jul 2024 17:02 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

AI Rights for Hu­man Safety

Simon Goldstein1 Aug 2024 23:01 UTC
45 points
6 comments1 min readLW link
(papers.ssrn.com)

[Question] What con­sti­tutes an in­fo­haz­ard?

K1r4d4rk.v18 Oct 2024 21:29 UTC
−4 points
8 comments1 min readLW link

Why hu­mans won’t con­trol su­per­hu­man AIs.

Spiritus Dei16 Oct 2024 16:48 UTC
−11 points
1 comment6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

Aligned AI as a wrap­per around an LLM

cousin_it25 Mar 2023 15:58 UTC
31 points
19 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_it24 Mar 2023 15:55 UTC
22 points
15 comments1 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher King20 Apr 2023 19:57 UTC
2 points
7 comments3 min readLW link

A con­cise sum-up of the ba­sic ar­gu­ment for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC
11 points
6 comments2 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
14 points
5 comments10 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleer1 Nov 2023 17:35 UTC
15 points
1 comment1 min readLW link
(arxiv.org)

​​ Open-ended/​Phenom­e­nal ​Ethics ​(TLDR)

Ryo 9 Nov 2023 16:58 UTC
3 points
0 comments1 min readLW link

Op­tion­al­ity ap­proach to ethics

Ryo 13 Nov 2023 15:23 UTC
7 points
2 comments3 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

Why small phe­nomenons are rele­vant to moral­ity ​

Ryo 13 Nov 2023 15:25 UTC
1 point
0 comments3 min readLW link

Re­ac­tion to “Em­pow­er­ment is (al­most) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC
9 points
3 comments5 min readLW link

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC
−5 points
4 comments1 min readLW link
(github.com)

Colour ver­sus Shape Goal Mis­gen­er­al­iza­tion in Re­in­force­ment Learn­ing: A Case Study

Karolis Jucys8 Dec 2023 13:18 UTC
13 points
1 comment4 min readLW link
(arxiv.org)

Tak­ing Into Ac­count Sen­tient Non-Hu­mans in AI Am­bi­tious Value Learn­ing: Sen­tien­tist Co­her­ent Ex­trap­o­lated Volition

Adrià Moret2 Dec 2023 14:07 UTC
26 points
31 comments42 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
9 points
0 comments11 min readLW link

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC
71 points
9 comments2 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC
81 points
843 comments2 min readLW link

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC
85 points
8 comments2 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoach4 May 2023 17:59 UTC
4 points
0 comments7 min readLW link

My preferred fram­ings for re­ward mis­speci­fi­ca­tion and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC
27 points
1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC
9 points
6 comments2 min readLW link

Re­ward is the op­ti­miza­tion tar­get (of ca­pa­bil­ities re­searchers)

Max H15 May 2023 3:22 UTC
32 points
4 comments5 min readLW link

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_Moe15 May 2023 17:41 UTC
7 points
0 comments4 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

Myspy18 May 2023 23:40 UTC
1 point
0 comments1 min readLW link
(drive.google.com)

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC
36 points
2 comments16 min readLW link

We Shouldn’t Ex­pect AI to Ever be Fully Rational

OneManyNone18 May 2023 17:09 UTC
19 points
31 comments6 min readLW link

[Question] Is “brit­tle al­ign­ment” good enough?

the8thbit23 May 2023 17:35 UTC
9 points
5 comments3 min readLW link

Two ideas for al­ign­ment, per­pet­ual mu­tual dis­trust and induction

APaleBlueDot25 May 2023 0:56 UTC
1 point
2 comments4 min readLW link

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

Rohin Shah7 Oct 2019 17:10 UTC
17 points
0 comments8 min readLW link
(mailchi.mp)

how hu­mans are aligned

bhauth26 May 2023 0:09 UTC
14 points
3 comments1 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
50 points
32 comments15 min readLW link

How will they feed us

meijer19731 Jun 2023 8:49 UTC
4 points
3 comments5 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
41 comments1 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

jan betley6 Jul 2023 16:21 UTC
37 points
2 comments7 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

Sim­ple al­ign­ment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC
4 points
8 comments1 min readLW link

Visi­ble loss land­scape bas­ins don’t cor­re­spond to dis­tinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC
68 points
13 comments4 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC
20 points
1 comment6 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
66 points
38 comments5 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
69 points
4 comments21 min readLW link

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC
17 points
12 comments4 min readLW link

In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC
15 points
0 comments5 min readLW link

(Non-de­cep­tive) Subop­ti­mal­ity Alignment

Sodium18 Oct 2023 2:07 UTC
5 points
1 comment9 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
69 points
4 comments6 min readLW link