Corrigibility

TagLast edit: Mar 23, 2025, 4:47 PM by Mateusz Bagiński

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. (Even though, if suspended, the AI will then be unable to fulfill what would usually be its goals.)
If we try to reprogram the AI’s utility function or meta-utility function, a corrigible AI will allow this modification to go through. (Rather than, e.g., fooling us into believing the utility function was modified successfully, while the AI actually keeps its original utility function as obscured functionality; as we would expect by default to be a preferred outcome according to the AI’s current preferences.)

More abstractly:

A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
A corrigible agent does not attempt to manipulate or deceive its operators, especially with respect to properties of the agent that might otherwise cause its operators to modify it.
A corrigible agent does not try to obscure its thought processes from its programmers or operators.
A corrigible agent is motivated to preserve the corrigibility of the larger system if that agent self-modifies, constructs sub-agents in the environment, or offloads part of its cognitive processing to external systems; or alternatively, the agent has no preference to execute any of those general activities.

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it’s possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

Good proposals for achieving corrigibility in specific regards are open problems in AI alignment. Some areas of active current research are Utility indifference and Interruptibility.

Achieving total corrigibility everywhere via some single, general mental state in which the AI “knows that it is still under construction” or “believes that the programmers know more than it does about its own goals” is termed ‘the hard problem of corrigibility’.

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by instrumentally convergent strategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U’. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U’) or manipulate its programmers (into believing that they actually prefer U to U’, or by coercing them into leaving its utility function intact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/manipulation incentives remains an open problem.

Trouble with utility function uncertainty

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers to attempt to correct what they see as mistakes in their attempts to formulate how the “correct” utility function should be determined from interaction with the environment.

As an overly simplistic example, imagine an agent attempting to maximize the internal happiness of all humans, but which has uncertainty about what that means. The operators might believe that if the agent does not act as intended, they can simply express their dissatisfaction and cause it to update. However, if the agent is reasoning according to an impoverished hypothesis space of utility functions, then it may behave quite incorrigibly: say it has narrowed down its consideration to two different hypotheses, one being that a certain type of opiate causes humans to experience maximal pleasure, and the other is that a certain type of stimulant causes humans to experience maximal pleasure. If the agent begins administering opiates to humans, and the humans resist, then the agent may “update” and start administering stimulants instead. But the agent would still be incorrigible — it would resist attempts by the programmers to turn it off so that it stops drugging people.

It does not seem that corrigibility can be trivially solved by specifying agents with uncertainty about their utility function. A corrigible agent must somehow also be able to reason about the fact that the humans themselves might have been confused or incorrect when specifying the process by which the utility function is identified, and so on.

Trouble with penalty terms

A second attempt at describing a corrigible agent might attempt to specify a utility function with “penalty terms” for bad behavior. This is unlikely to work for a number of reasons. First, there is the Nearest unblocked strategy problem: if a utility function gives an agent strong incentives to manipulate its operators, then adding a penalty for “manipulation” to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the “manipulation” cause. It is likely extremely difficult to specify conditions for “deception” and “manipulation” that actually rule out all undesirable behavior, especially if the agent is smarter than us or growing in capability.

More generally, it does not seem like a good policy to construct an agent that searches for positive-utility ways to deceive and manipulate the programmers, even if those searches are expected to fail. The goal of corrigibility is not to design agents that want to deceive but can’t. Rather, the goal is to construct agents that have no incentives to deceive or manipulate in the first place: a corrigible agent is one that reasons as if it is incomplete and potentially flawed in dangerous ways.

Open problems

Some open problems in corrigibility are:

Hard problem of corrigibility

On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

Utility indifference

explain utility indifference

The current state of technology on this is that the AI behaves as if there’s an absolutely fixed probability of the shutdown button being pressed, and therefore doesn’t try to modify this probability. But then the AI will try to use the shutdown button as an outcome pump. Is there any way to avert this?

Percentalization

Doing something in the top 0.1% of all actions. This is actually a Limited AI paradigm and ought to go there, not under Corrigibility.

Conservative strategies

Do something that’s as similar as possible to other outcomes and strategies that have been whitelisted. Also actually a Limited AI paradigm.

This seems like something that could be investigated in practice on e.g. a chess program.

Low impact measure

(Also really a Limited AI paradigm.)

Figure out a measure of ‘impact’ or ‘side effects’ such that if you tell the AI to paint all cars pink, it just paints all cars pink, and doesn’t transform Jupiter into a computer to figure out how to paint all cars pink, and doesn’t dump toxic runoff from the paint into groundwater; and also doesn’t create utility fog to make it look to people like the cars haven’t been painted pink (in order to minimize this ‘side effect’ of painting the cars pink), and doesn’t let the car-painting machines run wild afterward in order to minimize its own actions on the car-painting machines. Roughly, try to actually formalize the notion of “Just paint the cars pink with a minimum of side effects, dammit.”

It seems likely that this problem could turn out to be FAI-complete, if for example “Cure cancer, but then it’s okay if that causes human research investment into curing cancer to decrease” is only distinguishable by us as an okay side effect because it doesn’t result in expected utility decrease under our own desires.

It still seems like it might be good to, e.g., try to define “low side effect” or “low impact” inside the context of a generic Dynamic Bayes Net, and see if maybe we can find something after all that yields our intuitively desired behavior or helps to get closer to it.

Ambiguity identification

When there’s more than one thing the user could have meant, ask the user rather than optimizing the mixture. Even if A is in some sense a ‘simpler’ concept to classify the data than B, notice if B is also a ‘very plausible’ way to classify the data, and ask the user if they meant A or B. The goal here is to, in the classic ‘tank classifier’ problem where the tanks were photographed in lower-level illumination than the non-tanks, have something that asks the user, “Did you mean to detect tanks or low light or ‘tanks and low light’ or what?”

Safe outcome prediction and description

Communicate the AI’s predicted result of some action to the user, without putting the user inside an unshielded argmax of maximally effective communication.

Competence aversion

To build e.g. a behaviorist genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be ‘averting the instrumental pressure to become good at modeling a particular aspect of reality’. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn’t experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

Let’s See You Write That Corrigibility Tag

Eliezer YudkowskyJun 19, 2022, 9:11 PM

125 points

70 comments1 min readLW link

2. Corrigibility Intuition

Max HarmsJun 8, 2024, 3:52 PM

67 points

10 comments33 min readLW link

Corrigibility

paulfchristianoNov 27, 2018, 9:50 PM

57 points

8 comments6 min readLW link

What’s Hard About The Shutdown Problem

johnswentworthOct 20, 2023, 9:13 PM

98 points

33 comments4 min readLW link

Towards shutdownable agents via stochastic choice

EJT, alexr, christosi and LAThomson

Jul 8, 2024, 10:14 AM

59 points

11 comments23 min readLW link

(arxiv.org)

“Corrigibility at some small length” by dath ilan

Christopher KingApr 5, 2023, 1:47 AM

32 points

3 comments9 min readLW link

(www.glowfic.com)

A broad basin of attraction around human values?

Wei DaiApr 12, 2022, 5:15 AM

115 points

18 comments2 min readLW link

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

EJTOct 23, 2023, 9:00 PM

79 points

22 comments39 min readLW link

(philpapers.org)

0. CAST: Corrigibility as Singular Target

Max HarmsJun 7, 2024, 10:29 PM

147 points

17 comments8 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTroutNov 21, 2020, 7:35 PM

74 points

20 comments19 min readLW link

The Shutdown Problem: Incomplete Preferences as a Solution

EJTFeb 23, 2024, 4:01 PM

53 points

33 comments42 min readLW link

Corrigibility could make things worse

ThomasCederborgJun 11, 2024, 12:55 AM

9 points

6 comments6 min readLW link

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

johnswentworth and David Lorell

Jan 24, 2025, 8:20 PM

181 points

61 comments5 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrumpOct 18, 2022, 5:37 AM

9 points

0 comments2 min readLW link

(www.magfrump.net)

Reward Is Not Enough

Steven ByrnesJun 16, 2021, 1:52 PM

124 points

19 comments10 min readLW link 1 review

AI Assistants Should Have a Direct Line to Their Developers

Jan_KulveitDec 28, 2024, 5:01 PM

59 points

6 comments2 min readLW link

Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.

RyanCareyOct 21, 2018, 12:03 PM

23 points

1 comment6 min readLW link

Aggregating Utilities for Corrigible AI [Feedback Draft]

Dan H and Simon Goldstein

May 12, 2023, 8:57 PM

28 points

7 comments22 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilanJun 8, 2021, 11:20 PM

22 points

1 comment72 min readLW link

On corrigibility and its basin

Donald HobsonJun 20, 2022, 4:33 PM

16 points

3 comments2 min readLW link

Corrigibility’s Desirability is Timing-Sensitive

RobertMDec 26, 2024, 10:24 PM

29 points

4 comments3 min readLW link

Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility

OwenChenSep 25, 2024, 8:38 PM

3 points

0 comments4 min readLW link

[Question] What is wrong with this approach to corrigibility?

Rafael CosmanJul 12, 2022, 10:55 PM

7 points

8 comments1 min readLW link

Cake, or death!

Stuart_ArmstrongOct 25, 2012, 10:33 AM

47 points

13 comments4 min readLW link

Predictive model agents are sort of corrigible

Raymond DouglasJan 5, 2024, 2:05 PM

35 points

6 comments3 min readLW link

Corrigibility as outside view

TurnTroutMay 8, 2020, 9:56 PM

36 points

11 comments4 min readLW link

5. Open Corrigibility Questions

Max HarmsJun 10, 2024, 2:09 PM

30 points

0 comments7 min readLW link

Take 14: Corrigibility isn’t that great.

Charlie SteinerDec 25, 2022, 1:04 PM

15 points

3 comments3 min readLW link

[Question] Should you publish solutions to corrigibility?

rvnntJan 30, 2025, 11:52 AM

13 points

13 comments1 min readLW link

Game Theory without Argmax [Part 2]

Cleo NardoNov 11, 2023, 4:02 PM

31 points

14 comments13 min readLW link

Jan Kulveit’s Corrigibility Thoughts Distilled

brookAug 20, 2023, 5:52 PM

22 points

1 comment5 min readLW link

Do what we mean vs. do what we say

Rohin ShahAug 30, 2018, 10:03 PM

34 points

14 comments1 min readLW link

Can corrigibility be learned safely?

Wei DaiApr 1, 2018, 11:07 PM

35 points

115 comments4 min readLW link

Using predictors in corrigible systems

porbyJul 19, 2023, 10:29 PM

21 points

6 comments27 min readLW link

A Certain Formalization of Corrigibility Is VNM-Incoherent

TurnTroutNov 20, 2021, 12:30 AM

68 points

24 comments8 min readLW link

[Question] Why does advanced AI want not to be shut down?

RedFishBlueFishMar 28, 2023, 4:26 AM

2 points

19 comments1 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven ByrnesAug 3, 2020, 2:29 PM

55 points

35 comments4 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth HerdApr 18, 2023, 4:29 PM

88 points

18 comments20 min readLW link

A Critique of Non-Obstruction

Joe CollmanFeb 3, 2021, 8:45 AM

13 points

9 comments4 min readLW link

Formalizing Policy-Modification Corrigibility

TurnTroutDec 3, 2021, 1:31 AM

25 points

6 comments6 min readLW link

Consequentialism & corrigibility

Steven ByrnesDec 14, 2021, 1:23 PM

70 points

35 comments7 min readLW link

Consequentialists: One-Way Pattern Traps

David UdellJan 16, 2023, 8:48 PM

59 points

3 comments14 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven ByrnesJul 14, 2021, 3:11 PM

22 points

1 comment13 min readLW link

Corrigible omniscient AI capable of making clones

Kaj_SotalaMar 22, 2015, 12:19 PM

5 points

4 comments1 min readLW link

(www.sharelatex.com)

Internal independent review for language model agent alignment

Seth HerdJul 7, 2023, 6:54 AM

55 points

30 comments11 min readLW link

[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven ByrnesMay 11, 2022, 1:17 PM

45 points

25 comments20 min readLW link

Contrary to List of Lethality’s point 22, alignment’s door number 2

False NameDec 14, 2022, 10:01 PM

−2 points

5 comments22 min readLW link

Towards a mechanistic understanding of corrigibility

evhubAug 22, 2019, 11:20 PM

47 points

26 comments4 min readLW link

Thoughts on implementing corrigible robust alignment

Steven ByrnesNov 26, 2019, 2:06 PM

26 points

2 comments6 min readLW link

Another view of quantilizers: avoiding Goodhart’s Law

jessicataJan 9, 2016, 4:02 AM

26 points

2 comments2 min readLW link

Desiderata for an AI

Nathan Helm-BurgerJul 19, 2023, 4:18 PM

9 points

0 comments4 min readLW link

Game Theory without Argmax [Part 1]

Cleo NardoNov 11, 2023, 3:59 PM

70 points

18 comments19 min readLW link

People care about each other even though they have imperfect motivational pointers?

TurnTroutNov 8, 2022, 6:15 PM

33 points

25 comments7 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven ByrnesApr 8, 2021, 3:14 PM

63 points

7 comments26 min readLW link

AI Alignment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM

126 points

6 comments35 min readLW link

Corrigibility, Much more detail than anyone wants to Read

Logan ZoellnerMay 7, 2023, 1:02 AM

26 points

2 comments7 min readLW link

Detect Goodhart and shut down

Jeremy GillenJan 22, 2025, 6:45 PM

70 points

21 comments7 min readLW link

Testing for Scheming with Model Deletion

GuiveJan 7, 2025, 1:54 AM

59 points

21 comments21 min readLW link

(guive.substack.com)

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

AudereMay 2, 2023, 6:52 AM

66 points

13 comments9 min readLW link

A first look at the hard problem of corrigibility

jessicataOct 15, 2015, 8:16 PM

12 points

5 comments4 min readLW link

AIs Will Increasingly Fake Alignment

ZviDec 24, 2024, 1:00 PM

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

[Question] Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”

joraineNov 24, 2022, 5:08 AM

26 points

11 comments1 min readLW link

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. HudsonJul 16, 2024, 10:44 PM

44 points

27 comments5 min readLW link

Corrigibility Via Thought-Process Deference

Thane RuthenisNov 24, 2022, 5:06 PM

18 points

5 comments9 min readLW link

An Idea For Corrigible, Recursively Improving Math Oracles

jimrandomhJul 20, 2015, 3:35 AM

10 points

5 comments2 min readLW link

[Question] Training for corrigability: obvious problems?

Ben AmitayFeb 24, 2023, 2:02 PM

4 points

6 comments1 min readLW link

The limits of corrigibility

Stuart_ArmstrongApr 10, 2018, 10:49 AM

28 points

9 comments4 min readLW link

Corrigible but misaligned: a superintelligent messiah

zhukeepaApr 1, 2018, 6:20 AM

28 points

26 comments5 min readLW link

Corrigibility = Tool-ness?

johnswentworth and David Lorell

Jun 28, 2024, 1:19 AM

78 points

8 comments9 min readLW link

Hedonic Loops and Taming RL

berenJul 19, 2023, 3:12 PM

20 points

14 comments9 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidadDec 9, 2022, 2:06 PM

96 points

19 comments5 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM

16 points

0 comments42 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalledDec 6, 2021, 5:11 PM

8 points

1 comment7 min readLW link

«Boundaries/Membranes» and AI safety compilation

Chris LakinMay 3, 2023, 9:41 PM

56 points

17 comments8 min readLW link

Creating a self-referential system prompt for GPT-4

OzyrusMay 17, 2023, 2:13 PM

3 points

1 comment3 min readLW link

[Question] What are some good examples of incorrigibility?

RyanCareyApr 28, 2019, 12:22 AM

23 points

17 comments1 min readLW link

Announcement: AI alignment prize round 4 winners

cousin_itJan 20, 2019, 2:46 PM

74 points

41 comments1 min readLW link

1. The CAST Strategy

Max HarmsJun 7, 2024, 10:29 PM

48 points

22 comments38 min readLW link

Winners of AI Alignment Awards Research Contest

Orpheus16 and OliviaJ

Jul 13, 2023, 4:14 PM

115 points

4 comments12 min readLW link

(alignmentawards.com)

How RL Agents Behave When Their Actions Are Modified? [Distillation post]

PabloAMCMay 20, 2022, 6:47 PM

22 points

0 comments8 min readLW link

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

May 16, 2023, 10:53 AM

26 points

0 comments13 min readLW link

Creating AGI Safety Interlocks

Koen.HoltmanFeb 5, 2021, 12:01 PM

7 points

4 comments8 min readLW link

3a. Towards Formal Corrigibility

Max HarmsJun 9, 2024, 4:53 PM

24 points

2 comments19 min readLW link

The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)

GideonFJul 29, 2025, 11:20 PM

52 points

6 comments9 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworthAug 8, 2022, 6:05 PM

145 points

13 comments3 min readLW link

Collective Identity

NicholasKees, ukc10014 and Garrett Baker

May 18, 2023, 9:00 AM

59 points

12 comments8 min readLW link

Counterfactual Planning in AGI Systems

Koen.HoltmanFeb 3, 2021, 1:54 PM

10 points

0 comments5 min readLW link

Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile)

MiguelDevOct 12, 2023, 12:08 PM

12 points

0 comments9 min readLW link

Safely controlling the AGI agent reward function

Koen.HoltmanFeb 17, 2021, 2:47 PM

8 points

0 comments5 min readLW link

Infernal Corrigibility, Fiendishly Difficult

David UdellMay 27, 2022, 8:32 PM

24 points

1 comment13 min readLW link

The Perfection Trap: How Formally Aligned AI Systems May Create Inescapable Ethical Dystopias

Chris O'QuinnJun 1, 2025, 11:12 PM

1 point

0 comments43 min readLW link

Corrigibility thoughts III: manipulating versus deceiving

Stuart_ArmstrongJan 18, 2017, 3:57 PM

3 points

0 comments1 min readLW link

Just How Hard a Problem is Alignment?

Roger DearnaleyFeb 25, 2023, 9:00 AM

3 points

1 comment21 min readLW link

Corrigibility thoughts II: the robot operator

Stuart_ArmstrongJan 18, 2017, 3:52 PM

3 points

2 comments2 min readLW link

Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Daniel_EthJul 18, 2023, 8:26 AM

9 points

1 comment2 min readLW link

Instrumental Convergence Bounty

Logan ZoellnerSep 14, 2023, 2:02 PM

62 points

24 comments1 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM

5 points

1 comment7 min readLW link

Only a hack can solve the shutdown problem

dpJul 15, 2023, 8:26 PM

5 points

0 comments8 min readLW link

Corrigibility as Constrained Optimisation

Henrik ÅslundApr 11, 2019, 8:09 PM

15 points

3 comments5 min readLW link

[Question] A Question about Corrigibility (2015)

A.H.Nov 27, 2023, 12:05 PM

4 points

2 comments1 min readLW link

Introducing Corrigibility (an FAI research subfield)

So8resOct 20, 2014, 9:09 PM

52 points

28 comments3 min readLW link

Relational Design Can’t Be Left to Chance

Priyanka BharadwajJun 22, 2025, 3:32 PM

5 points

0 comments3 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments27 min readLW link

Three AI Safety Related Ideas

Wei DaiDec 13, 2018, 9:32 PM

70 points

38 comments2 min readLW link

Improvement on MIRI’s Corrigibility

WCargo and Charbel-Raphaël

Jun 9, 2023, 4:10 PM

54 points

8 comments13 min readLW link

[Question] Simple question about corrigibility and values in AI.

jmhOct 22, 2022, 2:59 AM

6 points

1 comment1 min readLW link

Disentangling Corrigibility: 2015-2021

Koen.HoltmanFeb 16, 2021, 6:01 PM

22 points

20 comments9 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

OzyrusMay 17, 2023, 2:13 PM

21 points

4 comments13 min readLW link

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher KingFeb 20, 2023, 3:11 PM

27 points

15 comments1 min readLW link

3b. Formal (Faux) Corrigibility

Max HarmsJun 9, 2024, 5:18 PM

26 points

13 comments17 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDevJun 19, 2023, 2:32 AM

4 points

2 comments7 min readLW link

Boeing 737 MAX MCAS as an agent corrigibility failure

ShmiMar 16, 2019, 1:46 AM

60 points

3 comments1 min readLW link

Corrigibility doesn’t always have a good action to take

Stuart_ArmstrongAug 28, 2018, 8:30 PM

19 points

0 comments1 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

May 9, 2022, 9:44 AM

41 points

6 comments8 min readLW link

Shutdown-Seeking AI

Simon GoldsteinMay 31, 2023, 10:19 PM

50 points

32 comments15 min readLW link

Why Eliminating Deception Won’t Align AI

Priyanka BharadwajJul 15, 2025, 9:21 AM

19 points

6 comments4 min readLW link

How useful is Corrigibility?

martinkunevSep 12, 2023, 12:05 AM

11 points

4 comments5 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

JustausernameAug 24, 2023, 3:53 AM

1 point

0 comments6 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

Roland PihlakasJan 12, 2025, 3:37 AM

46 points

7 comments10 min readLW link

Solve Corrigibility Week

Logan RiggsNov 28, 2021, 5:00 PM

39 points

21 comments1 min readLW link

Reframing AI Safety Through the Lens of Identity Maintenance Framework

Hiroshi YamakawaApr 1, 2025, 6:16 AM

−7 points

1 comment17 min readLW link

Question: MIRI Corrigbility Agenda

algon33Mar 13, 2019, 7:38 PM

15 points

11 comments1 min readLW link

A Shutdown Problem Proposal

johnswentworth and David Lorell

Jan 21, 2024, 6:12 PM

125 points

61 comments6 min readLW link

4. Existing Writing on Corrigibility

Max HarmsJun 10, 2024, 2:08 PM

55 points

17 comments106 min readLW link

Invulnerable Incomplete Preferences: A Formal Statement

SCPAug 30, 2023, 9:59 PM

136 points

39 comments35 min readLW link

Paying the corrigibility tax

Max HApr 19, 2023, 1:57 AM

14 points

1 comment13 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM

31 points

7 comments17 min readLW link

(docs.google.com)

Nash Bargaining between Subagents doesn’t solve the Shutdown Problem

A.H.Jan 25, 2024, 10:47 AM

22 points

1 comment9 min readLW link

Thinking about maximization and corrigibility

James PayorApr 21, 2023, 9:22 PM

63 points

4 comments5 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru23Jun 1, 2022, 1:36 PM

7 points

0 comments7 min readLW link

Agentized LLMs will change the alignment landscape

Seth HerdApr 9, 2023, 2:29 AM

160 points

102 comments3 min readLW link 1 review

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM

33 points

3 comments15 min readLW link

Simulators

janusSep 2, 2022, 12:45 PM

653 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Requirements for a Basin of Attraction to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM

41 points

12 comments31 min readLW link

A Pedagogical Guide to Corrigibility

A.H.Jan 17, 2024, 11:45 AM

6 points

3 comments16 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

Dec 5, 2022, 8:28 PM

40 points

19 comments10 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher KingJun 2, 2023, 9:54 PM

7 points

4 comments16 min readLW link

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

MiguelDevJul 5, 2023, 2:06 PM

12 points

11 comments15 min readLW link

Petrov corrigibility

Stuart_ArmstrongSep 11, 2018, 1:50 PM

20 points

10 comments1 min readLW link

Dath Ilan’s Views on Stopgap Corrigibility

David UdellSep 22, 2022, 4:16 PM

78 points

19 comments13 min readLW link

(www.glowfic.com)

New paper: Corrigibility with Utility Preservation

Koen.HoltmanAug 6, 2019, 7:04 PM

44 points

11 comments2 min readLW link

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

Dec 21, 2022, 1:40 AM

58 points

8 comments12 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM

14 points

5 comments10 min readLW link

A Corrigibility Metaphore—Big Gambles

WCargoMay 10, 2023, 6:13 PM

16 points

0 comments4 min readLW link

Mr. Meeseeks as an AI capability tripwire

Eric ZhangMay 19, 2023, 11:33 AM

37 points

17 comments2 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM

10 points

1 comment23 min readLW link

paulfchristiano Nov 20, 2015, 3:01 AM
4 points
0
Your characterization of utility indifference doesn’t seem quite right. More accurate would be: the agent behaves as if it were certain the shutdown button won’t do anything (because e.g. it is confident that a particular quantum coin will come up heads), and so won’t bother to either eliminate or preserve it.

When presenting this problem, it seems best to lead with the underlying intuition about self-doubt, since I think that seems more interesting than the narrower applications (e.g. shutdown button). The narrower applications nicely show that self-doubt has clear meaningful consequences.

Corrigibility

Difficulties

Deception and manipulation by default

Trouble with utility function uncertainty

Trouble with penalty terms

Open problems

Hard problem of corrigibility

Percentalization

Conservative strategies

Low impact measure

Ambiguity identification

Safe outcome prediction and description

Competence aversion

Further reading and references