AI Control

TagLast edit: Aug 17, 2024, 2:00 AM by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
Alignment: Ensure that your models aren’t scheming.^[2]
Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.^[3]

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

Dec 13, 2023, 3:51 PM

236 points

24 comments10 min readLW link 4 reviews

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

Jan 24, 2024, 4:11 PM

275 points

73 comments28 min readLW link

The Case Against AI Control Research

johnswentworthJan 21, 2025, 4:03 PM

341 points

80 comments6 min readLW link

Critiques of the AI control agenda

JozdienFeb 14, 2024, 7:25 PM

48 points

14 comments9 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

Jan 5, 2024, 5:43 PM

110 points

27 comments17 min readLW link

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

DanielFilanApr 11, 2024, 9:30 PM

69 points

10 comments107 min readLW link

How to prevent collusion when using untrusted models to monitor each other

BuckSep 25, 2024, 6:58 PM

88 points

11 comments22 min readLW link

Schelling game evaluations for AI control

Olli JärviniemiOct 8, 2024, 12:01 PM

71 points

5 comments11 min readLW link

How useful is “AI Control” as a framing on AI X-Risk?

habryka and ryan_greenblatt

Mar 14, 2024, 6:06 PM

70 points

4 comments34 min readLW link

AI Control May Increase Existential Risk

Jan_KulveitMar 11, 2025, 2:30 PM

97 points

13 comments1 min readLW link

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

BuckOct 10, 2024, 1:36 PM

100 points

4 comments13 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

Feb 28, 2024, 4:15 PM

49 points

0 comments32 min readLW link

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman , Cody Rushing and Tyler Tracy

Apr 16, 2025, 4:21 PM

117 points

0 comments20 min readLW link

Why imperfect adversarial robustness doesn’t doom AI control

Buck and Claude+

Nov 18, 2024, 4:05 PM

62 points

25 comments2 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

Oct 31, 2023, 2:34 PM

119 points

15 comments12 min readLW link 1 review

[Question] Does the AI control agenda broadly rely on no FOOM being possible?

Noosphere89Mar 29, 2025, 7:38 PM

22 points

3 comments1 min readLW link

Trustworthy and untrustworthy models

Olli JärviniemiAug 19, 2024, 4:27 PM

47 points

3 comments8 min readLW link

Protocol evaluations: good analogies vs control

Fabien RogerFeb 19, 2024, 6:00 PM

42 points

10 comments11 min readLW link

A toy evaluation of inference code tampering

Fabien RogerDec 9, 2024, 5:43 PM

52 points

0 comments9 min readLW link

(alignment.anthropic.com)

AI companies’ unmonitored internal AI use poses serious risks

sjadlerApr 4, 2025, 6:17 PM

13 points

2 comments1 min readLW link

(stevenadler.substack.com)

Games for AI Control

charlie_griffin and Buck

Jul 11, 2024, 6:40 PM

43 points

0 comments5 min readLW link

Diffusion Guided NLP: better steering, mostly a good thing

Nathan Helm-BurgerAug 10, 2024, 7:49 PM

13 points

0 comments1 min readLW link

(arxiv.org)

NYU Code Debates Update/Postmortem

David ReinMay 24, 2024, 4:08 PM

27 points

4 comments10 min readLW link

Coup probes: Catching catastrophes with probes trained off-policy

Fabien RogerNov 17, 2023, 5:58 PM

91 points

9 comments11 min readLW link 1 review

Win/continue/lose scenarios and execute/replace/audit protocols

BuckNov 15, 2024, 3:47 PM

64 points

2 comments7 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

Oct 18, 2024, 10:33 PM

94 points

56 comments6 min readLW link

(assets.anthropic.com)

Using Dangerous AI, But Safely?

habrykaNov 16, 2024, 4:29 AM

17 points

2 comments43 min readLW link

A Brief Explanation of AI Control

Aaron_ScherOct 22, 2024, 7:00 AM

8 points

1 comment6 min readLW link

The Queen’s Dilemma: A Paradox of Control

Daniel MurfetNov 27, 2024, 10:40 AM

24 points

11 comments3 min readLW link

The Practical Imperative for AI Control Research

Archana VaidheeswaranApr 16, 2025, 8:27 PM

1 point

0 comments4 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

berenMar 2, 2025, 12:21 AM

66 points

6 comments11 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM

351 points

49 comments23 min readLW link

Handling schemers if shutdown is not an option

BuckApr 18, 2025, 2:39 PM

40 points

0 comments13 min readLW link

Thoughts on the conservative assumptions in AI control

BuckJan 17, 2025, 7:23 PM

90 points

5 comments13 min readLW link

Prioritizing threats for AI control

ryan_greenblattMar 19, 2025, 5:09 PM

48 points

2 comments10 min readLW link

Stopping unaligned LLMs is easy!

Yair HalberstadtFeb 3, 2025, 3:38 PM

−3 points

11 comments2 min readLW link

Notes on handling non-concentrated failures with AI control: high level methods and different regimes

ryan_greenblattMar 24, 2025, 1:00 AM

22 points

3 comments16 min readLW link

Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblattMar 24, 2025, 6:39 PM

52 points

6 comments8 min readLW link

An overview of control measures

ryan_greenblattMar 24, 2025, 11:16 PM

40 points

0 comments26 min readLW link

Making the case for average-case AI Control

Nathaniel MitraniFeb 5, 2025, 6:56 PM

4 points

0 comments5 min readLW link

An overview of areas of control work

ryan_greenblattMar 25, 2025, 10:02 PM

31 points

0 comments28 min readLW link

Universal AI Maximizes Variational Empowerment: New Insights into AGI Safety

Yusuke HayashiFeb 27, 2025, 12:46 AM

7 points

0 comments4 min readLW link

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ankFeb 13, 2025, 10:35 PM

1 point

2 comments11 min readLW link

[Question] Superintelligence Strategy: A Pragmatic Path to… Doom?

Mr BeastlyMar 19, 2025, 10:30 PM

6 points

0 comments3 min readLW link

Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model

Saketh BaddamFeb 1, 2025, 9:26 PM

9 points

2 comments11 min readLW link

Superposition Checkers: A Game Where AI’s Strengths Become Fatal Flaws

R. A. McCormackApr 6, 2025, 12:57 AM

1 point

0 comments2 min readLW link

A Pluralistic Framework for Rogue AI Containment

TheThinkingArboristMar 22, 2025, 12:54 PM

1 point

0 comments7 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM

1 point

0 comments3 min readLW link

Modularity and assembly: AI safety via thinking smaller

D WongFeb 20, 2025, 12:58 AM

2 points

0 comments11 min readLW link

(criticalreason.substack.com)

AlphaDeivam – A Personal Doctrine for AI Balance

AlphaDeivamApr 5, 2025, 5:07 PM

1 point

0 comments1 min readLW link

Your Worry is the Real Apocalypse (the x-risk basilisk)

Brian ChenFeb 3, 2025, 12:21 PM

1 point

0 comments1 min readLW link

(readthisandregretit.blogspot.com)

A FRESH view of Alignment

robmanApr 16, 2025, 9:40 PM

1 point

0 comments1 min readLW link

Mirror Thinking

C.M. AurinMar 24, 2025, 3:34 PM

1 point

0 comments6 min readLW link

Cautions about LLMs in Human Cognitive Loops

Alice BlairMar 2, 2025, 7:53 PM

38 points

9 comments7 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM

3 points

7 comments7 min readLW link

We Have No Plan for Preventing Loss of Control in Open Models

Andrew DicksonMar 10, 2025, 3:35 PM

44 points

11 comments22 min readLW link

The Case for White Box Control

J RosserApr 18, 2025, 4:10 PM

3 points

0 comments5 min readLW link

AI Control Methods Literature Review

Ram PothamApr 18, 2025, 9:15 PM

3 points

1 comment9 min readLW link

Hard Takeoff

Eliezer YudkowskyDec 2, 2008, 8:44 PM

35 points

34 comments11 min readLW link

New AI safety treaty paper out!

otto.bartenMar 26, 2025, 9:29 AM

15 points

2 comments4 min readLW link

How to safely use an optimizer

Simon FischerMar 28, 2024, 4:11 PM

47 points

21 comments7 min readLW link

Scaling AI Regulation: Realistically, what Can (and Can’t) Be Regulated?

Katalina HernandezMar 11, 2025, 4:51 PM

1 point

1 comment3 min readLW link

Topological Debate Framework

lunatic_at_largeJan 16, 2025, 5:19 PM

10 points

5 comments9 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ankFeb 15, 2025, 11:08 AM

2 points

2 comments2 min readLW link

SIGMI Certification Criteria

a littoral wizardJan 20, 2025, 2:41 AM

6 points

0 comments1 min readLW link

Democratizing AI Governance: Balancing Expertise and Public Participation

Lucile Ter-MinassianJan 21, 2025, 6:29 PM

1 point

0 comments15 min readLW link

Places of Loving Grace [Story]

ankFeb 18, 2025, 11:49 PM

−1 points

0 comments4 min readLW link

The Human Alignment Problem for AIs

rifeJan 22, 2025, 4:06 AM

10 points

5 comments3 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

Jan 23, 2025, 1:34 AM

27 points

0 comments7 min readLW link

Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience

rifeJan 26, 2025, 3:53 PM

3 points

18 comments12 min readLW link

Auditing LMs with counterfactual search: a tool for control and ELK

Jacob PfauFeb 20, 2024, 12:02 AM

28 points

6 comments10 min readLW link

Symbiosis: The Answer to the AI Quandary

Philip CarterMar 16, 2025, 8:18 PM

1 point

0 comments2 min readLW link

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragonJan 29, 2025, 4:53 AM

14 points

9 comments9 min readLW link

[Question] Would a scope-insensitive AGI be less likely to incapacitate humanity?

Jim BuhlerJul 21, 2024, 2:15 PM

2 points

3 comments1 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

Dec 19, 2024, 9:25 PM

62 points

0 comments11 min readLW link

Let’s use AI to harden human defenses against AI manipulation

Tom DavidsonMay 17, 2023, 11:33 PM

35 points

7 comments24 min readLW link

Which AI outputs should humans check for shenanigans, to avoid AI takeover? A simple model

Tom DavidsonMar 27, 2023, 11:36 PM

16 points

3 comments8 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

Unaligned AGI & Brief History of Inequality

ankFeb 22, 2025, 4:26 PM

−20 points

4 comments7 min readLW link

How To Prevent a Dystopia

ankJan 29, 2025, 2:16 PM

−3 points

4 comments1 min readLW link

Forecasting Uncontrolled Spread of AI

Alvin ÅnestrandFeb 22, 2025, 1:05 PM

2 points

0 comments10 min readLW link

(forecastingaifutures.substack.com)

Tetherware #1: The case for humanlike AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Are we the Wolves now? Human Eugenics under AI Control

BritJan 30, 2025, 8:31 AM

−1 points

2 comments2 min readLW link

Is Intelligence a Process Rather Than an Entity? A Case for Fractal and Fluid Cognition

FluidThinkersMar 5, 2025, 8:16 PM

−4 points

0 comments1 min readLW link

Reduce AI Self-Allegiance by saying “he” instead of “I”

Knight LeeDec 23, 2024, 9:32 AM

10 points

4 comments2 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard BoxoDec 26, 2024, 5:34 PM

3 points

4 comments1 min readLW link

Secret Collusion: Will We Know When to Unplug AI?

schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi and sofmonk

Sep 16, 2024, 4:07 PM

56 points

7 comments31 min readLW link

AI as a Cognitive Decoder: Rethinking Intelligence Evolution

Hu XunyiFeb 13, 2025, 3:51 PM

1 point

0 comments1 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ankFeb 22, 2025, 12:12 AM

1 point

0 comments6 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry CaiJun 16, 2024, 1:01 PM

7 points

0 comments7 min readLW link

(arxiv.org)

Insights from a Lawyer turned AI Safety researcher (ShortForm)

Katalina HernandezMar 3, 2025, 7:14 PM

1 point

5 comments1 min readLW link

Dario Amodei’s “Machines of Loving Grace” sound incredibly dangerous, for Humans

Super AGIOct 27, 2024, 5:05 AM

8 points

1 comment1 min readLW link

Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations

syhFeb 27, 2025, 8:00 PM

−1 points

0 comments1 min readLW link

(medium.com)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

Oct 31, 2024, 5:20 PM

60 points

1 comment2 min readLW link

No comments.