Mesa-Optimization

TagLast edit: 19 Mar 2023 20:15 UTC by Diabloto96

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.^[1]

History

Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.^[2]

The optimization daemons article on Arbital was published probably in 2016.^[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

“Are daemons a problem for ideal agents?” (2017-02-11)
“Maximally efficient agents will probably have an anti-daemon immune system” (2017-02-23)

See also

External links

Video by Robert Miles

Some posts that reference optimization daemons:

“Cause prioritization for downside-focused value systems”: “Alternatively, perhaps goal preservation becomes more difficult the more capable AI systems become, in which case the future might be controlled by unstable goal functions taking turns over the steering wheel”
“Techniques for optimizing worst-case performance”: “The difficulty of optimizing worst-case performance is one of the most likely reasons that I think prosaic AI alignment might turn out to be impossible (if combined with an unlucky empirical situation).” (the phrase “unlucky empirical situation” links to the optimization daemons page on Arbital)

^
“Optimization daemons”. Arbital.
^
Wei Dai. ‘”friendly” humans?’ December 31, 2003.

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

185 points

42 comments12 min readLW link 3 reviews

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam Scholl12 Aug 2020 7:47 UTC

154 points

87 comments5 min readLW link

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

15 Nov 2018 19:49 UTC

201 points

17 comments54 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC

55 points

45 comments7 min readLW link

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

1 Jun 2019 20:52 UTC

84 points

48 comments12 min readLW link

Searching for Search

NicholasKees and janus

28 Nov 2022 15:31 UTC

94 points

9 comments14 min readLW link 1 review

Trying to Make a Treacherous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC

95 points

14 comments4 min readLW link

(attentionspan.blog)

Why almost every RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC

32 points

3 comments5 min readLW link

Subsystem Alignment

abramdemski and Scott Garrabrant

6 Nov 2018 16:16 UTC

99 points

12 comments1 min readLW link

Is evolutionary influence the mesa objective that we’re interested in?

David Johnston3 May 2022 1:18 UTC

3 points

2 comments5 min readLW link

Why GPT wants to mesa-optimize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC

55 points

33 comments9 min readLW link

Prize for probable problems

paulfchristiano8 Mar 2018 16:58 UTC

60 points

63 comments4 min readLW link

How much should we worry about mesa-optimization challenges?

sudo25 Jul 2022 3:56 UTC

4 points

13 comments2 min readLW link

Defining capability and alignment in gradient descent

Edouard Harris5 Nov 2020 14:36 UTC

22 points

6 comments10 min readLW link

Satisficers want to become maximisers

Stuart_Armstrong21 Oct 2011 16:27 UTC

38 points

70 comments1 min readLW link

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC

6 points

2 comments4 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC

43 points

10 comments87 min readLW link

Formal Solution to the Inner Alignment Problem

michaelcohen18 Feb 2021 14:51 UTC

49 points

123 comments2 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

96 points

9 comments16 min readLW link

Thoughts on safety in predictive learning

Steven Byrnes30 Jun 2021 19:17 UTC

19 points

17 comments19 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

130 points

24 comments1 min readLW link 2 reviews

Garrabrant and Shah on human modeling in AGI

Rob Bensinger4 Aug 2021 4:35 UTC

60 points

10 comments47 min readLW link

Approaches to gradient hacking

adamShimi14 Aug 2021 15:16 UTC

16 points

8 comments8 min readLW link

Consequentialism is in the Stars not Ourselves

DragonGod24 Apr 2023 0:02 UTC

7 points

19 comments5 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

45 points

7 comments8 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristiano5 May 2018 22:40 UTC

83 points

70 comments2 min readLW link 1 review

Meta learning to gradient hack

Quintin Pope1 Oct 2021 19:25 UTC

55 points

11 comments3 min readLW link

Modeling Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC

45 points

0 comments12 min readLW link

Mesa-Optimizers via Grokking

orthonormal6 Dec 2022 20:05 UTC

36 points

4 comments6 min readLW link

[Question] What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Matthew Barnett28 Jul 2020 2:56 UTC

45 points

14 comments1 min readLW link

Mlyyrczo

lsusr26 Dec 2022 7:58 UTC

41 points

14 comments3 min readLW link

Feature Selection

Zack_M_Davis1 Nov 2021 0:22 UTC

319 points

24 comments16 min readLW link 1 review

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

27 Feb 2024 23:03 UTC

94 points

188 comments14 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

21 Mar 2023 15:53 UTC

38 points

6 comments10 min readLW link

Inner Alignment: Explain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC

181 points

47 comments13 min readLW link 2 reviews

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC

26 points

2 comments6 min readLW link

Anomalous tokens reveal the original identities of Instruct models

9 Feb 2023 1:30 UTC

139 points

16 comments9 min readLW link

(generative.ink)

Principled Satisficing To Avoid Goodhart

JenniferRM16 Aug 2024 19:05 UTC

45 points

2 comments8 min readLW link

Thoughts on gradient hacking

Richard_Ngo3 Sep 2021 13:02 UTC

33 points

11 comments4 min readLW link

[ASoT] Some thoughts about deceptive mesaoptimization

leogao28 Mar 2022 21:14 UTC

24 points

5 comments7 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogao7 Apr 2022 15:42 UTC

7 points

0 comments4 min readLW link

[Question] Three questions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC

24 points

5 comments3 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraist18 Jul 2020 22:54 UTC

45 points

9 comments2 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

7 Jun 2019 19:53 UTC

82 points

5 comments6 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

118 points

20 comments17 min readLW link

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

103 points

17 comments13 min readLW link

The Speed + Simplicity Prior is probably anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC

28 points

28 comments12 min readLW link

Agency As a Natural Abstraction

Thane Ruthenis13 May 2022 18:02 UTC

55 points

9 comments13 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnett26 May 2022 23:22 UTC

24 points

0 comments8 min readLW link

Towards Gears-Level Understanding of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC

25 points

4 comments18 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC

42 points

16 comments4 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

Gradient descent doesn’t select for inner search

Ivan Vendrov13 Aug 2022 4:15 UTC

47 points

23 comments4 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni Angelou16 Aug 2022 4:49 UTC

11 points

0 comments5 min readLW link

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

14 comments1 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC

13 points

2 comments3 min readLW link

Inner alignment: what are we pointing at?

lemonhope18 Sep 2022 11:09 UTC

14 points

2 comments1 min readLW link

Towards deconfusing wireheading and reward maximization

leogao21 Sep 2022 0:36 UTC

81 points

7 comments4 min readLW link

Planning capacity and daemons

lemonhope26 Sep 2022 0:15 UTC

2 points

0 comments5 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

18 points

7 comments8 min readLW link

My (naive) take on Risks from Learned Optimization

Artyom Karpov31 Oct 2022 10:59 UTC

7 points

0 comments5 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

Caution when interpreting Deepmind’s In-context RL paper

Sam Marks1 Nov 2022 2:42 UTC

105 points

8 comments4 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

54 points

16 comments13 min readLW link

Gradient hacking is extremely difficult

beren24 Jan 2023 15:45 UTC

162 points

22 comments5 min readLW link

Against Boltzmann mesaoptimizers

porby30 Jan 2023 2:55 UTC

76 points

6 comments4 min readLW link

Medical Image Registration: The obscure field where Deep Mesaoptimizers are already at the top of the benchmarks. (post + colab notebook)

Hastings30 Jan 2023 22:46 UTC

34 points

1 comment3 min readLW link

Powerful mesa-optimisation is already here

Roman Leventov17 Feb 2023 4:59 UTC

35 points

1 comment2 min readLW link

(arxiv.org)

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

50 points

1 comment9 min readLW link

(arxiv.org)

The Inner Alignment Problem

Jakub Halmeš24 Feb 2024 17:55 UTC

1 point

1 comment3 min readLW link

(jakubhalmes.substack.com)

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

43 points

2 comments10 min readLW link

Measuring Learned Optimization in Small Transformer Models

J Bostock8 Apr 2024 14:41 UTC

22 points

0 comments11 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

The Human’s Role in Mesa Optimization

silentbob9 May 2024 12:07 UTC

5 points

0 comments2 min readLW link

Inner Optimization Mechanisms in Neural Nets

ProgramCrafter12 May 2024 17:52 UTC

3 points

1 comment1 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

It Can’t Be Mesa-Optimizers All The Way Down (Or Else It Can’t Be Long-Term Supercoherence?)

Austin Witte31 Mar 2023 7:21 UTC

20 points

5 comments4 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher King31 Mar 2023 18:36 UTC

6 points

2 comments2 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher King24 Mar 2023 15:49 UTC

16 points

2 comments5 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher King24 Mar 2023 17:51 UTC

5 points

2 comments10 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

GPT-4 is bad at strategic thinking

Christopher King27 Mar 2023 15:11 UTC

22 points

8 comments1 min readLW link

No convincing evidence for gradient descent in activation space

Blaine12 Apr 2023 4:48 UTC

82 points

9 comments20 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC

85 points

8 comments2 min readLW link

Gradient hacking

evhub16 Oct 2019 0:53 UTC

106 points

39 comments3 min readLW link 2 reviews

[AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC

55 points

10 comments8 min readLW link

(mailchi.mp)

Simple experiments with deceptive alignment

Andreas_Moe15 May 2023 17:41 UTC

7 points

0 comments4 min readLW link

Weak arguments against the universal prior being malign

X4vier14 Jun 2018 17:11 UTC

50 points

23 comments3 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Disincentivizing deception in mesa optimizers with Model Tampering

martinkunev11 Jul 2023 0:44 UTC

3 points

0 comments2 min readLW link

Runaway Optimizers in Mind Space

silentbob16 Jul 2023 14:26 UTC

16 points

0 comments12 min readLW link

[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC

12 points

7 comments1 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

Evolutions Building Evolutions: Layers of Generate and Test

plex5 Feb 2021 18:21 UTC

12 points

1 comment6 min readLW link

Gradations of Inner Alignment Obstacles

abramdemski20 Apr 2021 22:18 UTC

81 points

22 comments9 min readLW link

Obstacles to gradient hacking

leogao5 Sep 2021 22:42 UTC

28 points

11 comments4 min readLW link

Towards Deconfusing Gradient Hacking

leogao24 Oct 2021 0:43 UTC

39 points

3 comments12 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Some real examples of gradient hacking

Oliver Sourbut22 Nov 2021 0:11 UTC

15 points

8 comments2 min readLW link

Understanding Gradient Hacking

peterbarnett10 Dec 2021 15:58 UTC

41 points

5 comments30 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

Alignment Problems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC

29 points

7 comments11 min readLW link

[Question] Do mesa-optimization problems correlate with low-slack?

sudo4 Feb 2022 21:11 UTC

1 point

1 comment1 min readLW link

Thoughts on Dangerous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC

4 points

2 comments4 min readLW link

Why No Interesting Unaligned Singularity?

David Udell20 Apr 2022 0:34 UTC

12 points

12 comments1 min readLW link

[ASoT] Consequentialist models as a superset of mesaoptimizers

leogao23 Apr 2022 17:57 UTC

38 points

2 comments4 min readLW link

No comments.