RSS

AI Control

TagLast edit: Aug 17, 2024, 2:00 AM by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren’t scheming.[2]

  • Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.[3]

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Dec 13, 2023, 3:51 PM
236 points
24 comments10 min readLW link4 reviews

The case for en­sur­ing that pow­er­ful AIs are controlled

Jan 24, 2024, 4:11 PM
273 points
70 comments28 min readLW link

The Case Against AI Con­trol Research

johnswentworthJan 21, 2025, 4:03 PM
329 points
79 comments6 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

BuckSep 25, 2024, 6:58 PM
83 points
11 comments22 min readLW link

Schel­ling game eval­u­a­tions for AI control

Olli JärviniemiOct 8, 2024, 12:01 PM
67 points
5 comments11 min readLW link

Catch­ing AIs red-handed

Jan 5, 2024, 5:43 PM
110 points
27 comments17 min readLW link

Cri­tiques of the AI con­trol agenda

JozdienFeb 14, 2024, 7:25 PM
48 points
14 comments9 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilanApr 11, 2024, 9:30 PM
69 points
10 comments107 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

Mar 14, 2024, 6:06 PM
70 points
4 comments34 min readLW link

AI Con­trol May In­crease Ex­is­ten­tial Risk

Jan_KulveitMar 11, 2025, 2:30 PM
95 points
13 comments1 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

Nov 18, 2024, 4:05 PM
62 points
25 comments2 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

Oct 31, 2023, 2:34 PM
119 points
15 comments12 min readLW link1 review

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

BuckOct 10, 2024, 1:36 PM
100 points
4 comments13 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

Feb 28, 2024, 4:15 PM
49 points
0 comments32 min readLW link

Thoughts on the con­ser­va­tive as­sump­tions in AI control

BuckJan 17, 2025, 7:23 PM
90 points
5 comments13 min readLW link

Trust­wor­thy and un­trust­wor­thy models

Olli JärviniemiAug 19, 2024, 4:27 PM
46 points
3 comments8 min readLW link

A Brief Ex­pla­na­tion of AI Control

Aaron_ScherOct 22, 2024, 7:00 AM
8 points
1 comment6 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien RogerNov 17, 2023, 5:58 PM
91 points
9 comments11 min readLW link1 review

NYU Code De­bates Up­date/​Postmortem

David ReinMay 24, 2024, 4:08 PM
27 points
4 comments10 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM
344 points
49 comments23 min readLW link

Diffu­sion Guided NLP: bet­ter steer­ing, mostly a good thing

Nathan Helm-BurgerAug 10, 2024, 7:49 PM
13 points
0 comments1 min readLW link
(arxiv.org)

Pri­ori­tiz­ing threats for AI control

ryan_greenblattMar 19, 2025, 5:09 PM
47 points
2 comments10 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

berenMar 2, 2025, 12:21 AM
66 points
6 comments11 min readLW link

The Queen’s Dilemma: A Para­dox of Control

Daniel MurfetNov 27, 2024, 10:40 AM
24 points
11 comments3 min readLW link

Stop­ping un­al­igned LLMs is easy!

Yair HalberstadtFeb 3, 2025, 3:38 PM
−3 points
11 comments2 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

BuckNov 15, 2024, 3:47 PM
60 points
2 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

Mak­ing the case for av­er­age-case AI Control

Nathaniel MitraniFeb 5, 2025, 6:56 PM
3 points
0 comments5 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien RogerFeb 19, 2024, 6:00 PM
42 points
10 comments11 min readLW link

Us­ing Danger­ous AI, But Safely?

habrykaNov 16, 2024, 4:29 AM
17 points
2 comments43 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien RogerDec 9, 2024, 5:43 PM
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

Games for AI Control

Jul 11, 2024, 6:40 PM
43 points
0 comments5 min readLW link

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
60 points
0 comments5 min readLW link

How To Prevent a Dystopia

ankJan 29, 2025, 2:16 PM
−3 points
4 comments1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM
5 points
11 comments10 min readLW link
(tetherware.substack.com)

Are we the Wolves now? Hu­man Eu­gen­ics un­der AI Control

BritJan 30, 2025, 8:31 AM
−2 points
1 comment2 min readLW link

[Question] Su­per­in­tel­li­gence Strat­egy: A Prag­matic Path to… Doom?

Mr BeastlyMar 19, 2025, 10:30 PM
6 points
0 comments3 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

Saketh BaddamFeb 1, 2025, 9:26 PM
9 points
2 comments11 min readLW link

Your Worry is the Real Apoca­lypse (the x-risk basilisk)

Brian ChenFeb 3, 2025, 12:21 PM
1 point
0 comments1 min readLW link
(readthisandregretit.blogspot.com)

Hard Takeoff

Eliezer YudkowskyDec 2, 2008, 8:44 PM
35 points
34 comments11 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point
0 comments3 min readLW link

Ra­tional Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Non-Agen­tic Static Place AI, New Ethics… (V. 4)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments2 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu XunyiFeb 13, 2025, 3:51 PM
1 point
0 comments1 min readLW link

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob PfauFeb 20, 2024, 12:02 AM
28 points
6 comments10 min readLW link

Places of Lov­ing Grace [Story]

ankFeb 18, 2025, 11:49 PM
−1 points
0 comments4 min readLW link

How to safely use an optimizer

Simon FischerMar 28, 2024, 4:11 PM
47 points
21 comments7 min readLW link

Mo­du­lar­ity and as­sem­bly: AI safety via think­ing smaller

D WongFeb 20, 2025, 12:58 AM
2 points
0 comments11 min readLW link
(criticalreason.substack.com)

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ankFeb 13, 2025, 10:35 PM
1 point
2 comments11 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points
0 comments7 min readLW link
(arxiv.org)

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

Sep 16, 2024, 4:07 PM
56 points
7 comments31 min readLW link

Fore­cast­ing Un­con­trol­led Spread of AI

Alvin ÅnestrandFeb 22, 2025, 1:05 PM
2 points
0 comments10 min readLW link
(forecastingaifutures.substack.com)

Unal­igned AGI & Brief His­tory of Inequality

ankFeb 22, 2025, 4:26 PM
−18 points
4 comments7 min readLW link

[Question] Would a scope-in­sen­si­tive AGI be less likely to in­ca­pac­i­tate hu­man­ity?

Jim BuhlerJul 21, 2024, 2:15 PM
2 points
3 comments1 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
61 points
0 comments11 min readLW link

Let’s use AI to harden hu­man defenses against AI manipulation

Tom DavidsonMay 17, 2023, 11:33 PM
35 points
7 comments24 min readLW link

Which AI out­puts should hu­mans check for shenani­gans, to avoid AI takeover? A sim­ple model

Tom DavidsonMar 27, 2023, 11:36 PM
16 points
3 comments8 min readLW link

Is In­tel­li­gence a Pro­cess Rather Than an En­tity? A Case for Frac­tal and Fluid Cognition

FluidThinkersMar 5, 2025, 8:16 PM
−4 points
0 comments1 min readLW link

Re­duce AI Self-Alle­giance by say­ing “he” in­stead of “I”

Knight LeeDec 23, 2024, 9:32 AM
8 points
4 comments2 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard BoxoDec 26, 2024, 5:34 PM
3 points
4 comments1 min readLW link

Dario Amodei’s “Machines of Lov­ing Grace” sound in­cred­ibly dan­ger­ous, for Humans

Super AGIOct 27, 2024, 5:05 AM
8 points
1 comment1 min readLW link

Keep­ing AI Subor­di­nate to Hu­man Thought: A Pro­posal for Public AI Conversations

syhFeb 27, 2025, 8:00 PM
−1 points
0 comments1 min readLW link
(medium.com)

Toward Safety Cases For AI Scheming

Oct 31, 2024, 5:20 PM
60 points
1 comment2 min readLW link

Univer­sal AI Max­i­mizes Vari­a­tional Em­pow­er­ment: New In­sights into AGI Safety

Yusuke HayashiFeb 27, 2025, 12:46 AM
7 points
0 comments4 min readLW link

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice BlairMar 2, 2025, 7:53 PM
38 points
9 comments7 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM
3 points
7 comments7 min readLW link

We Have No Plan for Prevent­ing Loss of Con­trol in Open Models

Andrew DicksonMar 10, 2025, 3:35 PM
42 points
10 comments22 min readLW link

Scal­ing AI Reg­u­la­tion: Real­is­ti­cally, what Can (and Can’t) Be Reg­u­lated?

Katalina HernandezMar 11, 2025, 4:51 PM
1 point
1 comment3 min readLW link

Topolog­i­cal De­bate Framework

lunatic_at_largeJan 16, 2025, 5:19 PM
10 points
5 comments9 min readLW link

SIGMI Cer­tifi­ca­tion Criteria

a littoral wizardJan 20, 2025, 2:41 AM
6 points
0 comments1 min readLW link

De­moc­ra­tiz­ing AI Gover­nance: Balanc­ing Ex­per­tise and Public Participation

Lucile Ter-MinassianJan 21, 2025, 6:29 PM
1 point
0 comments15 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rifeJan 22, 2025, 4:06 AM
10 points
5 comments3 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

Jan 23, 2025, 1:34 AM
27 points
0 comments7 min readLW link

Disprov­ing the “Peo­ple-Pleas­ing” Hy­poth­e­sis for AI Self-Re­ports of Experience

rifeJan 26, 2025, 3:53 PM
3 points
18 comments12 min readLW link

Sym­bio­sis: The An­swer to the AI Quandary

Philip CarterMar 16, 2025, 8:18 PM
1 point
0 comments2 min readLW link

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

jwfiredragonJan 29, 2025, 4:53 AM
14 points
5 comments9 min readLW link
No comments.