RSS

Corrigibility

TagLast edit: Mar 23, 2025, 4:47 PM by Mateusz Bagiński

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

More abstractly:

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it’s possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

Good proposals for achieving corrigibility in specific regards are open problems in AI alignment. Some areas of active current research are Utility indifference and Interruptibility.

Achieving total corrigibility everywhere via some single, general mental state in which the AI “knows that it is still under construction” or “believes that the programmers know more than it does about its own goals” is termed ‘the hard problem of corrigibility’.

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by instrumentally convergent strategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U’. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U’) or manipulate its programmers (into believing that they actually prefer U to U’, or by coercing them into leaving its utility function intact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/​manipulation incentives remains an open problem.

Trouble with utility function uncertainty

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers to attempt to correct what they see as mistakes in their attempts to formulate how the “correct” utility function should be determined from interaction with the environment.

As an overly simplistic example, imagine an agent attempting to maximize the internal happiness of all humans, but which has uncertainty about what that means. The operators might believe that if the agent does not act as intended, they can simply express their dissatisfaction and cause it to update. However, if the agent is reasoning according to an impoverished hypothesis space of utility functions, then it may behave quite incorrigibly: say it has narrowed down its consideration to two different hypotheses, one being that a certain type of opiate causes humans to experience maximal pleasure, and the other is that a certain type of stimulant causes humans to experience maximal pleasure. If the agent begins administering opiates to humans, and the humans resist, then the agent may “update” and start administering stimulants instead. But the agent would still be incorrigible — it would resist attempts by the programmers to turn it off so that it stops drugging people.

It does not seem that corrigibility can be trivially solved by specifying agents with uncertainty about their utility function. A corrigible agent must somehow also be able to reason about the fact that the humans themselves might have been confused or incorrect when specifying the process by which the utility function is identified, and so on.

Trouble with penalty terms

A second attempt at describing a corrigible agent might attempt to specify a utility function with “penalty terms” for bad behavior. This is unlikely to work for a number of reasons. First, there is the Nearest unblocked strategy problem: if a utility function gives an agent strong incentives to manipulate its operators, then adding a penalty for “manipulation” to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the “manipulation” cause. It is likely extremely difficult to specify conditions for “deception” and “manipulation” that actually rule out all undesirable behavior, especially if the agent is smarter than us or growing in capability.

More generally, it does not seem like a good policy to construct an agent that searches for positive-utility ways to deceive and manipulate the programmers, even if those searches are expected to fail. The goal of corrigibility is not to design agents that want to deceive but can’t. Rather, the goal is to construct agents that have no incentives to deceive or manipulate in the first place: a corrigible agent is one that reasons as if it is incomplete and potentially flawed in dangerous ways.

Open problems

Some open problems in corrigibility are:

Hard problem of corrigibility

On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

Utility indifference

explain utility indifference

The current state of technology on this is that the AI behaves as if there’s an absolutely fixed probability of the shutdown button being pressed, and therefore doesn’t try to modify this probability. But then the AI will try to use the shutdown button as an outcome pump. Is there any way to avert this?

Percentalization

Doing something in the top 0.1% of all actions. This is actually a Limited AI paradigm and ought to go there, not under Corrigibility.

Conservative strategies

Do something that’s as similar as possible to other outcomes and strategies that have been whitelisted. Also actually a Limited AI paradigm.

This seems like something that could be investigated in practice on e.g. a chess program.

Low impact measure

(Also really a Limited AI paradigm.)

Figure out a measure of ‘impact’ or ‘side effects’ such that if you tell the AI to paint all cars pink, it just paints all cars pink, and doesn’t transform Jupiter into a computer to figure out how to paint all cars pink, and doesn’t dump toxic runoff from the paint into groundwater; and also doesn’t create utility fog to make it look to people like the cars haven’t been painted pink (in order to minimize this ‘side effect’ of painting the cars pink), and doesn’t let the car-painting machines run wild afterward in order to minimize its own actions on the car-painting machines. Roughly, try to actually formalize the notion of “Just paint the cars pink with a minimum of side effects, dammit.”

It seems likely that this problem could turn out to be FAI-complete, if for example “Cure cancer, but then it’s okay if that causes human research investment into curing cancer to decrease” is only distinguishable by us as an okay side effect because it doesn’t result in expected utility decrease under our own desires.

It still seems like it might be good to, e.g., try to define “low side effect” or “low impact” inside the context of a generic Dynamic Bayes Net, and see if maybe we can find something after all that yields our intuitively desired behavior or helps to get closer to it.

Ambiguity identification

When there’s more than one thing the user could have meant, ask the user rather than optimizing the mixture. Even if A is in some sense a ‘simpler’ concept to classify the data than B, notice if B is also a ‘very plausible’ way to classify the data, and ask the user if they meant A or B. The goal here is to, in the classic ‘tank classifier’ problem where the tanks were photographed in lower-level illumination than the non-tanks, have something that asks the user, “Did you mean to detect tanks or low light or ‘tanks and low light’ or what?”

Safe outcome prediction and description

Communicate the AI’s predicted result of some action to the user, without putting the user inside an unshielded argmax of maximally effective communication.

Competence aversion

To build e.g. a behaviorist genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be ‘averting the instrumental pressure to become good at modeling a particular aspect of reality’. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn’t experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

Further reading and references

Corrigibility

paulfchristianoNov 27, 2018, 9:50 PM
57 points
8 comments6 min readLW link

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer YudkowskyJun 19, 2022, 9:11 PM
123 points
70 comments1 min readLW link

2. Cor­rigi­bil­ity Intuition

Max HarmsJun 8, 2024, 3:52 PM
66 points
10 comments33 min readLW link

What’s Hard About The Shut­down Problem

johnswentworthOct 20, 2023, 9:13 PM
98 points
33 comments4 min readLW link

“Cor­rigi­bil­ity at some small length” by dath ilan

Christopher KingApr 5, 2023, 1:47 AM
32 points
3 comments9 min readLW link
(www.glowfic.com)

Towards shut­down­able agents via stochas­tic choice

Jul 8, 2024, 10:14 AM
59 points
11 comments23 min readLW link
(arxiv.org)

A broad basin of at­trac­tion around hu­man val­ues?

Wei DaiApr 12, 2022, 5:15 AM
114 points
18 comments2 min readLW link

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max HarmsJun 7, 2024, 10:29 PM
145 points
12 comments8 min readLW link

The Shut­down Prob­lem: An AI Eng­ineer­ing Puz­zle for De­ci­sion Theorists

EJTOct 23, 2023, 9:00 PM
79 points
22 comments1 min readLW link
(philpapers.org)

Cor­rigi­bil­ity could make things worse

ThomasCederborgJun 11, 2024, 12:55 AM
9 points
6 comments6 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
125 points
29 comments8 min readLW link
(arxiv.org)

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTroutNov 21, 2020, 7:35 PM
74 points
20 comments19 min readLW link

In­stru­men­tal Goals Are A Differ­ent And Friendlier Kind Of Thing Than Ter­mi­nal Goals

Jan 24, 2025, 8:20 PM
178 points
61 comments5 min readLW link

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

EJTFeb 23, 2024, 4:01 PM
52 points
33 comments42 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrumpOct 18, 2022, 5:37 AM
9 points
0 comments2 min readLW link
(www.magfrump.net)

Re­ward Is Not Enough

Steven ByrnesJun 16, 2021, 1:52 PM
123 points
19 comments10 min readLW link1 review

[Question] Dumb and ill-posed ques­tion: Is con­cep­tual re­search like this MIRI pa­per on the shut­down prob­lem/​Cor­rigi­bil­ity “real”

joraineNov 24, 2022, 5:08 AM
26 points
11 comments1 min readLW link

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False NameDec 14, 2022, 10:01 PM
−2 points
5 comments22 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilanJun 8, 2021, 11:20 PM
22 points
1 comment72 min readLW link

Take 14: Cor­rigi­bil­ity isn’t that great.

Charlie SteinerDec 25, 2022, 1:04 PM
15 points
3 comments3 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven ByrnesJul 14, 2021, 3:11 PM
22 points
1 comment13 min readLW link

Cake, or death!

Stuart_ArmstrongOct 25, 2012, 10:33 AM
47 points
13 comments4 min readLW link

Desider­ata for an AI

Nathan Helm-BurgerJul 19, 2023, 4:18 PM
9 points
0 comments4 min readLW link

Jan Kul­veit’s Cor­rigi­bil­ity Thoughts Distilled

brookAug 20, 2023, 5:52 PM
20 points
1 comment5 min readLW link

5. Open Cor­rigi­bil­ity Questions

Max HarmsJun 10, 2024, 2:09 PM
29 points
0 comments7 min readLW link

A Cri­tique of Non-Obstruction

Joe CollmanFeb 3, 2021, 8:45 AM
13 points
9 comments4 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTroutDec 3, 2021, 1:31 AM
25 points
6 comments6 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven ByrnesMay 11, 2022, 1:17 PM
45 points
25 comments20 min readLW link

Another view of quan­tiliz­ers: avoid­ing Good­hart’s Law

jessicataJan 9, 2016, 4:02 AM
26 points
2 comments2 min readLW link

Cor­rigi­bil­ity = Tool-ness?

Jun 28, 2024, 1:19 AM
78 points
8 comments9 min readLW link

[Question] Train­ing for cor­ri­ga­bil­ity: ob­vi­ous prob­lems?

Ben AmitayFeb 24, 2023, 2:02 PM
4 points
6 comments1 min readLW link

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. HudsonJul 16, 2024, 10:44 PM
44 points
27 comments5 min readLW link

On cor­rigi­bil­ity and its basin

Donald HobsonJun 20, 2022, 4:33 PM
16 points
3 comments2 min readLW link

Ex­tend­ing the Off-Switch Game: Toward a Ro­bust Frame­work for AI Corrigibility

OwenChenSep 25, 2024, 8:38 PM
3 points
0 comments4 min readLW link

AIs Will In­creas­ingly Fake Alignment

ZviDec 24, 2024, 1:00 PM
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

[Question] Why does ad­vanced AI want not to be shut down?

RedFishBlueFishMar 28, 2023, 4:26 AM
2 points
19 comments1 min readLW link

Cor­rigi­bil­ity’s De­sir­a­bil­ity is Timing-Sensitive

RobertMDec 26, 2024, 10:24 PM
29 points
4 comments3 min readLW link

AI As­sis­tants Should Have a Direct Line to Their Developers

Jan_KulveitDec 28, 2024, 5:01 PM
57 points
6 comments2 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth HerdJul 7, 2023, 6:54 AM
55 points
30 comments11 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo NardoNov 11, 2023, 3:59 PM
70 points
18 comments19 min readLW link

Test­ing for Schem­ing with Model Deletion

GuiveJan 7, 2025, 1:54 AM
59 points
21 comments21 min readLW link
(guive.substack.com)

Cor­rigi­bil­ity as out­side view

TurnTroutMay 8, 2020, 9:56 PM
36 points
11 comments4 min readLW link

Can cor­rigi­bil­ity be learned safely?

Wei DaiApr 1, 2018, 11:07 PM
35 points
115 comments4 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven ByrnesNov 26, 2019, 2:06 PM
26 points
2 comments6 min readLW link

An Idea For Cor­rigible, Re­cur­sively Im­prov­ing Math Oracles

jimrandomhJul 20, 2015, 3:35 AM
10 points
5 comments2 min readLW link

Cor­rigible om­ni­scient AI ca­pa­ble of mak­ing clones

Kaj_SotalaMar 22, 2015, 12:19 PM
5 points
4 comments1 min readLW link
(www.sharelatex.com)

Cor­rigible but mis­al­igned: a su­per­in­tel­li­gent messiah

zhukeepaApr 1, 2018, 6:20 AM
28 points
26 comments5 min readLW link

The limits of corrigibility

Stuart_ArmstrongApr 10, 2018, 10:49 AM
27 points
9 comments4 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhubAug 22, 2019, 11:20 PM
47 points
26 comments4 min readLW link

De­tect Good­hart and shut down

Jeremy GillenJan 22, 2025, 6:45 PM
68 points
21 comments7 min readLW link

An Im­pos­si­bil­ity Proof Rele­vant to the Shut­down Prob­lem and Corrigibility

AudereMay 2, 2023, 6:52 AM
66 points
13 comments9 min readLW link

Cor­rigi­bil­ity, Much more de­tail than any­one wants to Read

Logan ZoellnerMay 7, 2023, 1:02 AM
26 points
2 comments7 min readLW link

[Question] Should you pub­lish solu­tions to cor­rigi­bil­ity?

rvnntJan 30, 2025, 11:52 AM
13 points
13 comments1 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven ByrnesAug 3, 2020, 2:29 PM
55 points
35 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth HerdApr 18, 2023, 4:29 PM
86 points
18 comments20 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo NardoNov 11, 2023, 4:02 PM
31 points
14 comments13 min readLW link

He­donic Loops and Tam­ing RL

berenJul 19, 2023, 3:12 PM
20 points
14 comments9 min readLW link

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence.

RyanCareyOct 21, 2018, 12:03 PM
23 points
1 comment6 min readLW link

A first look at the hard prob­lem of corrigibility

jessicataOct 15, 2015, 8:16 PM
12 points
5 comments4 min readLW link

AI Align­ment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM
126 points
6 comments35 min readLW link

Ag­gre­gat­ing Utilities for Cor­rigible AI [Feed­back Draft]

May 12, 2023, 8:57 PM
28 points
7 comments22 min readLW link

Us­ing pre­dic­tors in cor­rigible systems

porbyJul 19, 2023, 10:29 PM
19 points
6 comments27 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael CosmanJul 12, 2022, 10:55 PM
7 points
8 comments1 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTroutNov 8, 2022, 6:15 PM
33 points
25 comments7 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David UdellJan 16, 2023, 8:48 PM
59 points
3 comments14 min readLW link

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane RuthenisNov 24, 2022, 5:06 PM
17 points
5 comments9 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven ByrnesApr 8, 2021, 3:14 PM
63 points
7 comments26 min readLW link

A Cer­tain For­mal­iza­tion of Cor­rigi­bil­ity Is VNM-Incoherent

TurnTroutNov 20, 2021, 12:30 AM
67 points
24 comments8 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven ByrnesDec 14, 2021, 1:23 PM
70 points
29 comments7 min readLW link

Pre­dic­tive model agents are sort of corrigible

Raymond DJan 5, 2024, 2:05 PM
35 points
6 comments3 min readLW link

Do what we mean vs. do what we say

Rohin ShahAug 30, 2018, 10:03 PM
34 points
14 comments1 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

May 16, 2023, 10:53 AM
26 points
0 comments13 min readLW link

A Cor­rigi­bil­ity Me­taphore—Big Gambles

WCargoMay 10, 2023, 6:13 PM
16 points
0 comments4 min readLW link

Col­lec­tive Identity

May 18, 2023, 9:00 AM
59 points
12 comments8 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

OzyrusMay 17, 2023, 2:13 PM
3 points
1 comment3 min readLW link

Mr. Meeseeks as an AI ca­pa­bil­ity tripwire

Eric ZhangMay 19, 2023, 11:33 AM
37 points
17 comments2 min readLW link

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Koen.HoltmanAug 6, 2019, 7:04 PM
44 points
11 comments2 min readLW link

Shut­down-Seek­ing AI

Simon GoldsteinMay 31, 2023, 10:19 PM
50 points
32 comments15 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8resOct 20, 2014, 9:09 PM
52 points
28 comments3 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDevJun 19, 2023, 2:32 AM
4 points
2 comments7 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

Jul 13, 2023, 4:14 PM
115 points
4 comments12 min readLW link
(alignmentawards.com)

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

JustausernameAug 24, 2023, 3:53 AM
1 point
0 comments6 min readLW link

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.HoltmanFeb 3, 2021, 1:54 PM
10 points
0 comments5 min readLW link

Creat­ing AGI Safety Interlocks

Koen.HoltmanFeb 5, 2021, 12:01 PM
7 points
4 comments8 min readLW link

Safely con­trol­ling the AGI agent re­ward function

Koen.HoltmanFeb 17, 2021, 2:47 PM
8 points
0 comments5 min readLW link

In­fer­nal Cor­rigi­bil­ity, Fiendishly Difficult

David UdellMay 27, 2022, 8:32 PM
24 points
1 comment13 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru23Jun 1, 2022, 1:36 PM
7 points
0 comments7 min readLW link

Dath Ilan’s Views on Stop­gap Corrigibility

David UdellSep 22, 2022, 4:16 PM
78 points
19 comments13 min readLW link
(www.glowfic.com)

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmhOct 22, 2022, 2:59 AM
6 points
1 comment1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger DearnaleyFeb 25, 2023, 9:00 AM
3 points
1 comment21 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworthAug 8, 2022, 6:05 PM
143 points
13 comments3 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidadDec 9, 2022, 2:06 PM
96 points
19 comments5 min readLW link

Solve Cor­rigi­bil­ity Week

Logan RiggsNov 28, 2021, 5:00 PM
39 points
21 comments1 min readLW link

Only a hack can solve the shut­down problem

dpJul 15, 2023, 8:26 PM
5 points
0 comments8 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCPAug 30, 2023, 9:59 PM
134 points
39 comments35 min readLW link

Three AI Safety Re­lated Ideas

Wei DaiDec 13, 2018, 9:32 PM
69 points
38 comments2 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalledDec 6, 2021, 5:11 PM
8 points
1 comment7 min readLW link

Simulators

janusSep 2, 2022, 12:45 PM
631 points
168 comments41 min readLW link8 reviews
(generative.ink)

A Ped­a­gog­i­cal Guide to Corrigibility

A.H.Jan 17, 2024, 11:45 AM
6 points
3 comments16 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points
12 comments31 min readLW link

Rele­vance of ‘Harm­ful In­tel­li­gence’ Data in Train­ing Datasets (We­bText vs. Pile)

MiguelDevOct 12, 2023, 12:08 PM
12 points
0 comments9 min readLW link

3a. Towards For­mal Corrigibility

Max HarmsJun 9, 2024, 4:53 PM
22 points
2 comments19 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max HarmsJun 10, 2024, 2:08 PM
49 points
15 comments106 min readLW link

A Shut­down Prob­lem Proposal

Jan 21, 2024, 6:12 PM
125 points
61 comments6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM
15 points
0 comments27 min readLW link

Select Agent Speci­fi­ca­tions as Nat­u­ral Abstractions

lukemarksApr 7, 2023, 11:16 PM
19 points
3 comments5 min readLW link

[Question] A Ques­tion about Cor­rigi­bil­ity (2015)

A.H.Nov 27, 2023, 12:05 PM
4 points
2 comments1 min readLW link

Boe­ing 737 MAX MCAS as an agent cor­rigi­bil­ity failure

ShmiMar 16, 2019, 1:46 AM
60 points
3 comments1 min readLW link

«Boundaries/​Mem­branes» and AI safety compilation

ChipmonkMay 3, 2023, 9:41 PM
57 points
17 comments8 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

OzyrusMay 17, 2023, 2:13 PM
21 points
4 comments13 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher KingJun 2, 2023, 9:54 PM
7 points
4 comments16 min readLW link

Im­prove­ment on MIRI’s Corrigibility

Jun 9, 2023, 4:10 PM
54 points
8 comments13 min readLW link

Cor­rigi­bil­ity as Con­strained Optimisation

Henrik ÅslundApr 11, 2019, 8:09 PM
15 points
3 comments5 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM
16 points
0 comments42 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM
5 points
1 comment7 min readLW link

Up­dat­ing Utility Functions

May 9, 2022, 9:44 AM
41 points
6 comments8 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMCMay 20, 2022, 6:47 PM
22 points
0 comments8 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

Dec 5, 2022, 8:28 PM
40 points
19 comments10 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

Dec 21, 2022, 1:40 AM
58 points
8 comments12 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM
31 points
7 comments17 min readLW link
(docs.google.com)

Ex­plor­ing Func­tional De­ci­sion The­ory (FDT) and a mod­ified ver­sion (ModFDT)

MiguelDevJul 5, 2023, 2:06 PM
11 points
11 comments15 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCareyApr 28, 2019, 12:22 AM
23 points
17 comments1 min readLW link

Cor­rigi­bil­ity thoughts II: the robot operator

Stuart_ArmstrongJan 18, 2017, 3:52 PM
3 points
2 comments2 min readLW link

Train for in­cor­rigi­bil­ity, then re­verse it (Shut­down Prob­lem Con­test Sub­mis­sion)

Daniel_EthJul 18, 2023, 8:26 AM
9 points
1 comment1 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_ArmstrongJan 18, 2017, 3:57 PM
3 points
0 comments1 min readLW link

Ques­tion: MIRI Cor­rig­bil­ity Agenda

algon33Mar 13, 2019, 7:38 PM
15 points
11 comments1 min readLW link

Petrov corrigibility

Stuart_ArmstrongSep 11, 2018, 1:50 PM
20 points
10 comments1 min readLW link

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take

Stuart_ArmstrongAug 28, 2018, 8:30 PM
19 points
0 comments1 min readLW link

In­stru­men­tal Con­ver­gence Bounty

Logan ZoellnerSep 14, 2023, 2:02 PM
62 points
24 comments1 min readLW link

How use­ful is Cor­rigi­bil­ity?

martinkunevSep 12, 2023, 12:05 AM
11 points
4 comments5 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.HoltmanFeb 16, 2021, 6:01 PM
22 points
20 comments9 min readLW link

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher KingFeb 20, 2023, 3:11 PM
27 points
15 comments1 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth HerdApr 9, 2023, 2:29 AM
160 points
102 comments3 min readLW link1 review

Nash Bar­gain­ing be­tween Subagents doesn’t solve the Shut­down Problem

A.H.Jan 25, 2024, 10:47 AM
22 points
1 comment9 min readLW link

1. The CAST Strategy

Max HarmsJun 7, 2024, 10:29 PM
46 points
19 comments38 min readLW link

3b. For­mal (Faux) Corrigibility

Max HarmsJun 9, 2024, 5:18 PM
21 points
13 comments17 min readLW link

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well)

Roland PihlakasJan 12, 2025, 3:37 AM
43 points
7 comments10 min readLW link

Pay­ing the cor­rigi­bil­ity tax

Max HApr 19, 2023, 1:57 AM
14 points
1 comment13 min readLW link

Think­ing about max­i­miza­tion and corrigibility

James PayorApr 21, 2023, 9:22 PM
63 points
4 comments5 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM
14 points
5 comments10 min readLW link

An­nounce­ment: AI al­ign­ment prize round 4 winners

cousin_itJan 20, 2019, 2:46 PM
74 points
41 comments1 min readLW link