RSS

Mesa-Optimization

TagLast edit: 19 Mar 2023 20:15 UTC by Diabloto96

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[1]

History

Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.[2]

The optimization daemons article on Arbital was published probably in 2016.[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

See also

External links

Video by Robert Miles

Some posts that reference optimization daemons:

  1. ^
  2. ^

    Wei Dai. ‘”friendly” humans?’ December 31, 2003.

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
185 points
42 comments12 min readLW link3 reviews

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
154 points
87 comments5 min readLW link

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
201 points
17 comments54 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
59 points
8 comments2 min readLW link
(arxiv.org)

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
55 points
45 comments7 min readLW link

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
84 points
48 comments12 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
94 points
9 comments14 min readLW link1 review

Try­ing to Make a Treach­er­ous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC
95 points
14 comments4 min readLW link
(attentionspan.blog)

Why al­most ev­ery RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC
32 points
3 comments5 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
99 points
12 comments1 min readLW link

Is evolu­tion­ary in­fluence the mesa ob­jec­tive that we’re in­ter­ested in?

David Johnston3 May 2022 1:18 UTC
3 points
2 comments5 min readLW link

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
55 points
33 comments9 min readLW link

Prize for prob­a­ble problems

paulfchristiano8 Mar 2018 16:58 UTC
60 points
63 comments4 min readLW link

How much should we worry about mesa-op­ti­miza­tion challenges?

sudo25 Jul 2022 3:56 UTC
4 points
13 comments2 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

Satis­ficers want to be­come maximisers

Stuart_Armstrong21 Oct 2011 16:27 UTC
38 points
70 comments1 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
43 points
10 comments87 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
49 points
123 comments2 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
96 points
9 comments16 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
19 points
17 comments19 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC
130 points
24 comments1 min readLW link2 reviews

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
60 points
10 comments47 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

Con­se­quen­tial­ism is in the Stars not Ourselves

DragonGod24 Apr 2023 0:02 UTC
7 points
19 comments5 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
83 points
70 comments2 min readLW link1 review

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
55 points
11 comments3 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
45 points
0 comments12 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormal6 Dec 2022 20:05 UTC
36 points
4 comments6 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
45 points
14 comments1 min readLW link

Mlyyrczo

lsusr26 Dec 2022 7:58 UTC
41 points
14 comments3 min readLW link

Fea­ture Selection

Zack_M_Davis1 Nov 2021 0:22 UTC
319 points
24 comments16 min readLW link1 review

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

27 Feb 2024 23:03 UTC
94 points
188 comments14 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
38 points
6 comments10 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
181 points
47 comments13 min readLW link2 reviews

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

9 Feb 2023 1:30 UTC
139 points
16 comments9 min readLW link
(generative.ink)

Prin­ci­pled Satis­fic­ing To Avoid Goodhart

JenniferRM16 Aug 2024 19:05 UTC
45 points
2 comments8 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
33 points
11 comments4 min readLW link

[ASoT] Some thoughts about de­cep­tive mesaoptimization

leogao28 Mar 2022 21:14 UTC
24 points
5 comments7 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

[Question] Three ques­tions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC
24 points
5 comments3 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
9 comments2 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
82 points
5 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
118 points
20 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
103 points
17 comments13 min readLW link

The Speed + Sim­plic­ity Prior is prob­a­bly anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC
28 points
28 comments12 min readLW link

Agency As a Nat­u­ral Abstraction

Thane Ruthenis13 May 2022 18:02 UTC
55 points
9 comments13 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

Towards Gears-Level Un­der­stand­ing of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC
25 points
4 comments18 min readLW link

Goal Align­ment Is Ro­bust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC
42 points
16 comments4 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
38 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
47 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni Angelou16 Aug 2022 4:49 UTC
11 points
0 comments5 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
14 comments1 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC
13 points
2 comments3 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lemonhope18 Sep 2022 11:09 UTC
14 points
2 comments1 min readLW link

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogao21 Sep 2022 0:36 UTC
81 points
7 comments4 min readLW link

Plan­ning ca­pac­ity and daemons

lemonhope26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
18 points
7 comments8 min readLW link

My (naive) take on Risks from Learned Optimization

Artyom Karpov31 Oct 2022 10:59 UTC
7 points
0 comments5 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
34 points
20 comments34 min readLW link

Cau­tion when in­ter­pret­ing Deep­mind’s In-con­text RL paper

Sam Marks1 Nov 2022 2:42 UTC
105 points
8 comments4 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
24 points
38 comments3 min readLW link

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
54 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
162 points
22 comments5 min readLW link

Against Boltz­mann mesaoptimizers

porby30 Jan 2023 2:55 UTC
76 points
6 comments4 min readLW link

Med­i­cal Image Regis­tra­tion: The ob­scure field where Deep Me­saop­ti­miz­ers are already at the top of the bench­marks. (post + co­lab note­book)

Hastings30 Jan 2023 22:46 UTC
34 points
1 comment3 min readLW link

Pow­er­ful mesa-op­ti­mi­sa­tion is already here

Roman Leventov17 Feb 2023 4:59 UTC
35 points
1 comment2 min readLW link
(arxiv.org)

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

28 May 2024 5:29 UTC
50 points
1 comment9 min readLW link
(arxiv.org)

The In­ner Align­ment Problem

Jakub Halmeš24 Feb 2024 17:55 UTC
1 point
1 comment3 min readLW link
(jakubhalmes.substack.com)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

7 May 2023 17:00 UTC
43 points
2 comments10 min readLW link

Mea­sur­ing Learned Op­ti­miza­tion in Small Trans­former Models

J Bostock8 Apr 2024 14:41 UTC
22 points
0 comments11 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

9 May 2024 6:40 UTC
4 points
0 comments5 min readLW link

The Hu­man’s Role in Mesa Optimization

silentbob9 May 2024 12:07 UTC
5 points
0 comments2 min readLW link

In­ner Op­ti­miza­tion Mechanisms in Neu­ral Nets

ProgramCrafter12 May 2024 17:52 UTC
3 points
1 comment1 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

It Can’t Be Mesa-Op­ti­miz­ers All The Way Down (Or Else It Can’t Be Long-Term Su­per­co­her­ence?)

Austin Witte31 Mar 2023 7:21 UTC
20 points
5 comments4 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher King31 Mar 2023 18:36 UTC
6 points
2 comments2 min readLW link

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher King24 Mar 2023 15:49 UTC
16 points
2 comments5 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher King24 Mar 2023 17:51 UTC
5 points
2 comments10 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC
6 points
4 comments4 min readLW link

GPT-4 is bad at strate­gic thinking

Christopher King27 Mar 2023 15:11 UTC
22 points
8 comments1 min readLW link

No con­vinc­ing ev­i­dence for gra­di­ent de­scent in ac­ti­va­tion space

Blaine12 Apr 2023 4:48 UTC
82 points
9 comments20 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC
85 points
8 comments2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
106 points
39 comments3 min readLW link2 reviews

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC
55 points
10 comments8 min readLW link
(mailchi.mp)

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_Moe15 May 2023 17:41 UTC
7 points
0 comments4 min readLW link

Weak ar­gu­ments against the uni­ver­sal prior be­ing malign

X4vier14 Jun 2018 17:11 UTC
50 points
23 comments3 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Dis­in­cen­tiviz­ing de­cep­tion in mesa op­ti­miz­ers with Model Tampering

martinkunev11 Jul 2023 0:44 UTC
3 points
0 comments2 min readLW link

Ru­n­away Op­ti­miz­ers in Mind Space

silentbob16 Jul 2023 14:26 UTC
16 points
0 comments12 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC
20 points
1 comment6 min readLW link

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plex5 Feb 2021 18:21 UTC
12 points
1 comment6 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
81 points
22 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
28 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
39 points
3 comments12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
41 points
5 comments30 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
29 points
7 comments11 min readLW link

[Question] Do mesa-op­ti­miza­tion prob­lems cor­re­late with low-slack?

sudo4 Feb 2022 21:11 UTC
1 point
1 comment1 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David Udell20 Apr 2022 0:34 UTC
12 points
12 comments1 min readLW link

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

leogao23 Apr 2022 17:57 UTC
38 points
2 comments4 min readLW link
No comments.