RSS

AI Control

TagLast edit: 17 Aug 2024 2:00 UTC by Ben Pace

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren’t scheming.[2]

  • Control: Ensure that even if your models are scheming, you’ll be safe, because they are not capable of subverting your safety measures.[3]

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
228 points
18 comments10 min readLW link1 review

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
260 points
66 comments28 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
104 points
22 comments17 min readLW link

Schel­ling game eval­u­a­tions for AI control

Olli Järviniemi8 Oct 2024 12:01 UTC
65 points
5 comments11 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
14 comments9 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

14 Mar 2024 18:06 UTC
70 points
4 comments34 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
69 points
10 comments107 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
81 points
8 comments22 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
61 points
26 comments2 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
48 points
0 comments32 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
113 points
14 comments12 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

Buck10 Oct 2024 13:36 UTC
100 points
4 comments13 min readLW link

The Queen’s Dilemma: A Para­dox of Control

Daniel Murfet27 Nov 2024 10:40 UTC
24 points
11 comments3 min readLW link

Trust­wor­thy and un­trust­wor­thy models

Olli Järviniemi19 Aug 2024 16:27 UTC
46 points
3 comments8 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
50 points
0 comments9 min readLW link
(alignment.anthropic.com)

Games for AI Control

11 Jul 2024 18:40 UTC
43 points
0 comments5 min readLW link

Diffu­sion Guided NLP: bet­ter steer­ing, mostly a good thing

Nathan Helm-Burger10 Aug 2024 19:49 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

NYU Code De­bates Up­date/​Postmortem

David Rein24 May 2024 16:08 UTC
27 points
4 comments10 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
85 points
9 comments11 min readLW link1 review

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
54 points
2 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
93 points
55 comments6 min readLW link
(assets.anthropic.com)

Us­ing Danger­ous AI, But Safely?

habryka16 Nov 2024 4:29 UTC
17 points
2 comments43 min readLW link

A Brief Ex­pla­na­tion of AI Control

Aaron_Scher22 Oct 2024 7:00 UTC
7 points
1 comment6 min readLW link

Dario Amodei’s “Machines of Lov­ing Grace” sound in­cred­ibly dan­ger­ous, for Humans

Super AGI27 Oct 2024 5:05 UTC
8 points
1 comment1 min readLW link

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

Which AI out­puts should hu­mans check for shenani­gans, to avoid AI takeover? A sim­ple model

Tom Davidson27 Mar 2023 23:36 UTC
16 points
3 comments8 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

Let’s use AI to harden hu­man defenses against AI manipulation

Tom Davidson17 May 2023 23:33 UTC
34 points
7 comments24 min readLW link

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob Pfau20 Feb 2024 0:02 UTC
28 points
6 comments10 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
26 points
0 comments11 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

[Question] Would a scope-in­sen­si­tive AGI be less likely to in­ca­pac­i­tate hu­man­ity?

Jim Buhler21 Jul 2024 14:15 UTC
2 points
3 comments1 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

16 Sep 2024 16:07 UTC
55 points
7 comments31 min readLW link
No comments.