RSS

Mesa-Optimization

TagLast edit: Mar 19, 2023, 8:15 PM by Diabloto96

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[1]

History

Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.[2]

The optimization daemons article on Arbital was published probably in 2016.[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

See also

External links

Video by Robert Miles

Some posts that reference optimization daemons:

  1. ^
  2. ^

    Wei Dai. ‘”friendly” humans?’ December 31, 2003.

Risks from Learned Op­ti­miza­tion: Introduction

May 31, 2019, 11:44 PM
187 points
42 comments12 min readLW link3 reviews

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam SchollAug 12, 2020, 7:47 AM
154 points
87 comments5 min readLW link

Embed­ded Agency (full-text ver­sion)

Nov 15, 2018, 7:49 PM
201 points
17 comments54 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Mesa-Search vs Mesa-Control

abramdemskiAug 18, 2020, 6:51 PM
55 points
45 comments7 min readLW link

Try­ing to Make a Treach­er­ous Mesa-Optimizer

MadHatterNov 9, 2022, 6:07 PM
95 points
14 comments4 min readLW link
(attentionspan.blog)

Search­ing for Search

Nov 28, 2022, 3:31 PM
97 points
9 comments14 min readLW link1 review

Con­di­tions for Mesa-Optimization

Jun 1, 2019, 8:52 PM
84 points
48 comments12 min readLW link

Why al­most ev­ery RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM
32 points
3 comments5 min readLW link

Sub­sys­tem Alignment

Nov 6, 2018, 4:16 PM
102 points
12 comments1 min readLW link

The Speed + Sim­plic­ity Prior is prob­a­bly anti-deceptive

Yonadav ShavitApr 27, 2022, 7:30 PM
30 points
28 comments12 min readLW link

Is evolu­tion­ary in­fluence the mesa ob­jec­tive that we’re in­ter­ested in?

David JohnstonMay 3, 2022, 1:18 AM
3 points
2 comments5 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

Jun 7, 2019, 7:53 PM
82 points
5 comments6 min readLW link

De­cep­tive Alignment

Jun 5, 2019, 8:16 PM
118 points
20 comments17 min readLW link

Why GPT wants to mesa-op­ti­mize & how we might change this

John_MaxwellSep 19, 2020, 1:48 PM
55 points
33 comments9 min readLW link

How much should we worry about mesa-op­ti­miza­tion challenges?

sudoJul 25, 2022, 3:56 AM
4 points
13 comments2 min readLW link

Prize for prob­a­ble problems

paulfchristianoMar 8, 2018, 4:58 PM
60 points
63 comments4 min readLW link

The In­ner Align­ment Problem

Jun 4, 2019, 1:20 AM
104 points
17 comments13 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard HarrisNov 5, 2020, 2:36 PM
22 points
6 comments10 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. HudsonAug 17, 2022, 3:56 AM
6 points
2 comments4 min readLW link

Satis­ficers want to be­come maximisers

Stuart_ArmstrongOct 21, 2011, 4:27 PM
38 points
70 comments1 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilanFeb 18, 2021, 12:03 AM
43 points
10 comments87 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohenFeb 18, 2021, 2:51 PM
49 points
123 comments2 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark XuNov 6, 2020, 11:48 PM
96 points
9 comments16 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven ByrnesJun 30, 2021, 7:17 PM
20 points
17 comments19 min readLW link

Utility ≠ Reward

Vlad MikulikSep 5, 2019, 5:28 PM
131 points
24 comments1 min readLW link2 reviews

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob BensingerAug 4, 2021, 4:35 AM
60 points
10 comments47 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimiAug 14, 2021, 3:16 PM
16 points
8 comments8 min readLW link

Thoughts on gra­di­ent hacking

Richard_NgoSep 3, 2021, 1:02 PM
33 points
11 comments4 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormalDec 6, 2022, 8:05 PM
36 points
4 comments6 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin PopeOct 1, 2021, 7:25 PM
55 points
11 comments3 min readLW link

Mlyyrczo

lsusrDec 26, 2022, 7:58 AM
41 points
14 comments3 min readLW link

Model­ing Risks From Learned Optimization

Ben CottierOct 12, 2021, 8:54 PM
45 points
0 comments12 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven ByrnesJul 10, 2020, 4:49 PM
45 points
7 comments8 min readLW link

Fea­ture Selection

Zack_M_DavisNov 1, 2021, 12:22 AM
322 points
24 comments16 min readLW link1 review

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristianoMay 5, 2018, 10:40 PM
83 points
70 comments2 min readLW link1 review

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew BarnettJul 28, 2020, 2:56 AM
45 points
14 comments1 min readLW link

Clar­ify­ing mesa-optimization

Mar 21, 2023, 3:53 PM
38 points
6 comments10 min readLW link

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

Feb 9, 2023, 1:30 AM
139 points
16 comments9 min readLW link
(generative.ink)

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael HarthAug 1, 2020, 3:24 PM
184 points
47 comments13 min readLW link2 reviews

Prin­ci­pled Satis­fic­ing To Avoid Goodhart

JenniferRMAug 16, 2024, 7:05 PM
45 points
2 comments8 min readLW link

AXRP Epi­sode 38.3 - Erik Jen­ner on Learned Look-Ahead

DanielFilanDec 12, 2024, 5:40 AM
20 points
0 comments16 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_ArmstrongMar 2, 2020, 12:16 PM
26 points
2 comments6 min readLW link

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

Feb 27, 2024, 11:03 PM
101 points
188 comments14 min readLW link

[ASoT] Some thoughts about de­cep­tive mesaoptimization

leogaoMar 28, 2022, 9:14 PM
24 points
5 comments7 min readLW link

Turn­ing up the Heat on De­cep­tively-Misal­igned AI

J BostockJan 7, 2025, 12:13 AM
19 points
16 comments4 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogaoApr 7, 2022, 3:42 PM
7 points
0 comments4 min readLW link

[Question] Three ques­tions about mesa-optimizers

Eric NeymanApr 12, 2022, 2:58 AM
26 points
5 comments3 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraistJul 18, 2020, 10:54 PM
45 points
9 comments2 min readLW link

Con­se­quen­tial­ism is in the Stars not Ourselves

DragonGodApr 24, 2023, 12:02 AM
7 points
19 comments5 min readLW link

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

leogaoApr 23, 2022, 5:57 PM
38 points
2 comments4 min readLW link

Agency As a Nat­u­ral Abstraction

Thane RuthenisMay 13, 2022, 6:02 PM
55 points
9 comments13 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnettMay 26, 2022, 11:22 PM
24 points
0 comments8 min readLW link

Towards Gears-Level Un­der­stand­ing of Agency

Thane RuthenisJun 16, 2022, 10:00 PM
25 points
4 comments18 min readLW link

Goal Align­ment Is Ro­bust To the Sharp Left Turn

Thane RuthenisJul 13, 2022, 8:23 PM
43 points
16 comments4 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane RuthenisAug 4, 2022, 11:31 PM
38 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan VendrovAug 13, 2022, 4:15 AM
47 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni AngelouAug 16, 2022, 4:49 AM
11 points
0 comments5 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane RuthenisAug 17, 2022, 6:47 PM
42 points
14 comments1 min readLW link

Broad Pic­ture of Hu­man Values

Thane RuthenisAug 20, 2022, 7:42 PM
42 points
6 comments10 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane RuthenisAug 29, 2022, 6:37 PM
14 points
2 comments3 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lemonhopeSep 18, 2022, 11:09 AM
14 points
2 comments1 min readLW link

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogaoSep 21, 2022, 12:36 AM
81 points
7 comments4 min readLW link

Plan­ning ca­pac­ity and daemons

lemonhopeSep 26, 2022, 12:15 AM
2 points
0 comments5 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM
21 points
7 comments8 min readLW link

My (naive) take on Risks from Learned Optimization

Artyom KarpovOct 31, 2022, 10:59 AM
7 points
0 comments5 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane RuthenisNov 15, 2022, 5:16 PM
34 points
20 comments34 min readLW link

Cau­tion when in­ter­pret­ing Deep­mind’s In-con­text RL paper

Sam MarksNov 1, 2022, 2:42 AM
105 points
8 comments4 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM
13 points
0 comments13 min readLW link

In Defense of Wrap­per-Minds

Thane RuthenisDec 28, 2022, 6:28 PM
24 points
38 comments3 min readLW link

Gra­di­ent Filtering

Jan 18, 2023, 8:09 PM
56 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

berenJan 24, 2023, 3:45 PM
164 points
22 comments5 min readLW link

Against Boltz­mann mesaoptimizers

porbyJan 30, 2023, 2:55 AM
77 points
6 comments4 min readLW link

Med­i­cal Image Regis­tra­tion: The ob­scure field where Deep Me­saop­ti­miz­ers are already at the top of the bench­marks. (post + co­lab note­book)

HastingsJan 30, 2023, 10:46 PM
35 points
1 comment3 min readLW link

Pow­er­ful mesa-op­ti­mi­sa­tion is already here

Roman LeventovFeb 17, 2023, 4:59 AM
35 points
1 comment2 min readLW link
(arxiv.org)

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

May 28, 2024, 5:29 AM
50 points
1 comment9 min readLW link
(arxiv.org)

The In­ner Align­ment Problem

Jakub HalmešFeb 24, 2024, 5:55 PM
1 point
1 comment3 min readLW link
(jakubhalmes.substack.com)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

May 7, 2023, 5:00 PM
43 points
2 comments10 min readLW link

Mea­sur­ing Learned Op­ti­miza­tion in Small Trans­former Models

J BostockApr 8, 2024, 2:41 PM
22 points
0 comments11 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

May 9, 2024, 6:40 AM
4 points
0 comments5 min readLW link

The Hu­man’s Role in Mesa Optimization

silentbobMay 9, 2024, 12:07 PM
5 points
0 comments2 min readLW link

In­ner Op­ti­miza­tion Mechanisms in Neu­ral Nets

ProgramCrafterMay 12, 2024, 5:52 PM
3 points
1 comment1 min readLW link

Why Re­cur­sive Self-Im­prove­ment Might Not Be the Ex­is­ten­tial Risk We Fear

Nassim_ANov 24, 2024, 5:17 PM
1 point
0 comments9 min readLW link

What are the plans for solv­ing the in­ner al­ign­ment prob­lem?

Leonard HollowayJan 17, 2025, 9:45 PM
12 points
4 comments1 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM
15 points
0 comments27 min readLW link

It Can’t Be Mesa-Op­ti­miz­ers All The Way Down (Or Else It Can’t Be Long-Term Su­per­co­her­ence?)

Austin WitteMar 31, 2023, 7:21 AM
20 points
5 comments4 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher KingMar 31, 2023, 6:36 PM
6 points
2 comments2 min readLW link

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher KingMar 24, 2023, 3:49 PM
16 points
2 comments5 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher KingMar 24, 2023, 5:51 PM
5 points
2 comments10 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM
6 points
4 comments4 min readLW link

GPT-4 is bad at strate­gic thinking

Christopher KingMar 27, 2023, 3:11 PM
22 points
8 comments1 min readLW link

No con­vinc­ing ev­i­dence for gra­di­ent de­scent in ac­ti­va­tion space

BlaineApr 12, 2023, 4:48 AM
85 points
9 comments20 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul CologneseApr 12, 2023, 3:39 PM
9 points
7 comments12 min readLW link

2-D Robustness

Vlad MikulikAug 30, 2019, 8:27 PM
85 points
8 comments2 min readLW link

Gra­di­ent hacking

evhubOct 16, 2019, 12:53 AM
107 points
39 comments3 min readLW link2 reviews

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

Rohin ShahJun 24, 2019, 4:10 PM
55 points
10 comments8 min readLW link
(mailchi.mp)

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_MoeMay 15, 2023, 5:41 PM
7 points
0 comments4 min readLW link

Weak ar­gu­ments against the uni­ver­sal prior be­ing malign

X4vierJun 14, 2018, 5:11 PM
50 points
23 comments3 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points
0 comments2 min readLW link

Dis­in­cen­tiviz­ing de­cep­tion in mesa op­ti­miz­ers with Model Tampering

martinkunevJul 11, 2023, 12:44 AM
3 points
0 comments2 min readLW link

Ru­n­away Op­ti­miz­ers in Mind Space

silentbobJul 16, 2023, 2:26 PM
16 points
0 comments12 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben CottierSep 10, 2020, 3:36 PM
12 points
7 comments1 min readLW link

Mesa-Op­ti­miza­tion: Ex­plain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM
20 points
1 comment6 min readLW link

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plexFeb 5, 2021, 6:21 PM
12 points
1 comment6 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemskiApr 20, 2021, 10:18 PM
84 points
22 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogaoSep 5, 2021, 10:42 PM
28 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogaoOct 24, 2021, 12:43 AM
39 points
3 comments12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin PopeOct 13, 2021, 8:52 PM
9 points
0 comments2 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver SourbutNov 22, 2021, 12:11 AM
15 points
8 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnettDec 10, 2021, 3:58 PM
41 points
5 comments30 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM
16 points
0 comments42 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnettJan 22, 2022, 12:19 AM
29 points
7 comments11 min readLW link

[Question] Do mesa-op­ti­miza­tion prob­lems cor­re­late with low-slack?

sudoFeb 4, 2022, 9:11 PM
1 point
1 comment1 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnettFeb 19, 2022, 10:46 AM
4 points
2 comments4 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David UdellApr 20, 2022, 12:34 AM
12 points
12 comments1 min readLW link
No comments.