RSS

AI Box­ing (Con­tain­ment)

TagLast edit: 12 Sep 2020 5:17 UTC by habryka

AI Boxing is attempts, experiments, or proposals to isolate (“box”) a powerful AI (~AGI) where it can’t interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or “boxed”, it will be harmless even if it is an unfriendly artificial intelligence (UAI).

Challenges are: 1) can you successively prevent it from interacting with the world? And 2) can you prevent it from convincing you to let it out?

See also: AI, AGI, Oracle AI, Tool AI, Unfriendly AI

Escaping the box

It is not regarded as likely that an AGI can be boxed in the long term. Since the AGI might be a superintelligence, it could persuade someone (the human liaison, most likely) to free it from its box and thus, human control. Some practical ways of achieving this goal include:

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation, or it might discover and exploit unknown physics to free itself.

Containing the AGI

Attempts to box an AGI may add some degree of safety to the development of a friendly artificial intelligence (FAI). A number of strategies for keeping an AGI in its box are discussed in Thinking inside the box and Leakproofing the Singularity. Among them are:

Simulations /​ Experiments

The AI Box Experiment is a game meant to explore the possible pitfalls of AI boxing. It is played over text chat, with one human roleplaying as an AI in a box, and another human roleplaying as a gatekeeper with the ability to let the AI out of the box. The AI player wins if they successfully convince the gatekeeper to let them out of the box, and the gatekeeper wins if the AI player has not been freed after a certain period of time.

Both Eliezer Yudkowsky and Justin Corwin have ran simulations, pretending to be a superintelligence, and been able to convince a human playing a guard to let them out on many—but not all—occasions. Eliezer’s five experiments required the guard to listen for at least two hours with participants who had approached him, while Corwin’s 26 experiments had no time limit and subjects he approached.

The text of Eliezer’s experiments have not been made public.

List of experiments

References

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC
394 points
176 comments10 min readLW link

Cryp­to­graphic Boxes for Un­friendly AI

paulfchristiano18 Dec 2010 8:28 UTC
71 points
162 comments5 min readLW link

How it feels to have your mind hacked by an AI

blaked12 Jan 2023 0:33 UTC
361 points
221 comments17 min readLW link

The AI in a box boxes you

Stuart_Armstrong2 Feb 2010 10:10 UTC
170 points
389 comments1 min readLW link

That Alien Mes­sage—The Animation

Writer7 Sep 2024 14:53 UTC
144 points
9 comments8 min readLW link
(youtu.be)

The case for train­ing fron­tier AIs on Sume­rian-only corpus

15 Jan 2024 16:40 UTC
130 points
15 comments3 min readLW link

Thoughts on “Pro­cess-Based Su­per­vi­sion”

Steven Byrnes17 Jul 2023 14:08 UTC
74 points
4 comments23 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer Yudkowsky15 Jul 2009 2:27 UTC
131 points
614 comments2 min readLW link

[Question] Boxing

Zach Stein-Perlman2 Aug 2023 23:38 UTC
6 points
1 comment1 min readLW link

[Question] AI Box Ex­per­i­ment: Are peo­ple still in­ter­ested?

Double31 Aug 2022 3:04 UTC
30 points
13 comments1 min readLW link

I at­tempted the AI Box Ex­per­i­ment (and lost)

Tuxedage21 Jan 2013 2:59 UTC
79 points
245 comments5 min readLW link

I at­tempted the AI Box Ex­per­i­ment again! (And won—Twice!)

Tuxedage5 Sep 2013 4:49 UTC
78 points
168 comments12 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
64 points
72 comments44 min readLW link1 review

How To Win The AI Box Ex­per­i­ment (Some­times)

pinkgothic12 Sep 2015 12:34 UTC
55 points
21 comments22 min readLW link

[Question] Is keep­ing AI “in the box” dur­ing train­ing enough?

tgb6 Jul 2021 15:17 UTC
7 points
10 comments1 min readLW link

My take on Ja­cob Can­nell’s take on AGI safety

Steven Byrnes28 Nov 2022 14:01 UTC
71 points
15 comments30 min readLW link1 review

I wanted to in­ter­view Eliezer Yud­kowsky but he’s busy so I simu­lated him instead

lsusr16 Sep 2021 7:34 UTC
111 points
33 comments5 min readLW link

Side-chan­nels: in­put ver­sus output

davidad12 Dec 2022 12:32 UTC
44 points
16 comments2 min readLW link

I Am Scared of Post­ing Nega­tive Takes About Bing’s AI

Yitz17 Feb 2023 20:50 UTC
63 points
28 comments1 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
28 points
30 comments11 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

[In­tro to brain-like-AGI safety] 11. Safety ≠ al­ign­ment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC
35 points
1 comment10 min readLW link

Box­ing an AI?

tailcalled27 Mar 2015 14:06 UTC
3 points
39 comments1 min readLW link

[Question] Why do so many think de­cep­tion in AI is im­por­tant?

Prometheus13 Jan 2024 8:14 UTC
23 points
12 comments1 min readLW link

Dreams of Friendliness

Eliezer Yudkowsky31 Aug 2008 1:20 UTC
28 points
81 comments9 min readLW link

AI Align­ment Prize: Su­per-Box­ing

X4vier18 Mar 2018 1:03 UTC
16 points
6 comments6 min readLW link

Mul­ti­ple AIs in boxes, eval­u­at­ing each other’s alignment

Moebius31429 May 2022 8:36 UTC
8 points
0 comments14 min readLW link

Loose thoughts on AGI risk

Yitz23 Jun 2022 1:02 UTC
7 points
3 comments1 min readLW link

Su­per­in­tel­li­gence 13: Ca­pa­bil­ity con­trol methods

KatjaGrace9 Dec 2014 2:00 UTC
14 points
48 comments6 min readLW link

Quan­tum AI Box

Gurkenglas8 Jun 2018 16:20 UTC
4 points
15 comments1 min readLW link

AI-Box Ex­per­i­ment—The Acausal Trade Argument

XiXiDu8 Jul 2011 9:18 UTC
14 points
20 comments2 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

Analysing: Danger­ous mes­sages from fu­ture UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC
22 points
16 comments4 min readLW link

[Question] Is there a sim­ple pa­ram­e­ter that con­trols hu­man work­ing mem­ory ca­pac­ity, which has been set trag­i­cally low?

Liron23 Aug 2019 22:10 UTC
17 points
8 comments1 min readLW link

Self-shut­down AI

jan betley21 Aug 2023 16:48 UTC
13 points
2 comments2 min readLW link

xkcd on the AI box experiment

FiftyTwo21 Nov 2014 8:26 UTC
28 points
234 comments1 min readLW link

Con­tain­ing the AI… In­side a Si­mu­lated Reality

HumaneAutomation31 Oct 2020 16:16 UTC
1 point
9 comments2 min readLW link

AI box: AI has one shot at avoid­ing de­struc­tion—what might it say?

ancientcampus22 Jan 2013 20:22 UTC
25 points
355 comments1 min readLW link

AI Box Log

Dorikka27 Jan 2012 4:47 UTC
23 points
30 comments23 min readLW link

[Question] Danger(s) of the­o­rem-prov­ing AI?

Yitz16 Mar 2022 2:47 UTC
8 points
8 comments1 min readLW link

An AI-in-a-box suc­cess model

azsantosk11 Apr 2022 22:28 UTC
16 points
1 comment10 min readLW link

Another ar­gu­ment that you will let the AI out of the box

Garrett Baker19 Apr 2022 21:54 UTC
8 points
16 comments2 min readLW link

Pivotal acts us­ing an un­al­igned AGI?

Simon Fischer21 Aug 2022 17:13 UTC
28 points
3 comments7 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
13 points
7 comments9 min readLW link

An­thro­po­mor­phic AI and Sand­boxed Vir­tual Uni­verses

jacob_cannell3 Sep 2010 19:02 UTC
4 points
124 comments5 min readLW link

Sand­box­ing by Phys­i­cal Si­mu­la­tion?

moridinamael1 Aug 2018 0:36 UTC
12 points
4 comments1 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

An Un­canny Prison

Nathan112313 Aug 2022 21:40 UTC
3 points
3 comments2 min readLW link

Gate­keeper Vic­tory: AI Box Reflection

9 Sep 2022 21:38 UTC
6 points
6 comments9 min readLW link

How to Study Un­safe AGI’s safely (and why we might have no choice)

Punoxysm7 Mar 2014 7:24 UTC
10 points
47 comments5 min readLW link

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
51 points
22 comments4 min readLW link

Another prob­lem with AI con­fine­ment: or­di­nary CPUs can work as ra­dio transmitters

RomanS14 Oct 2022 8:28 UTC
35 points
1 comment1 min readLW link
(news.softpedia.com)

De­ci­sion the­ory does not im­ply that we get to have nice things

So8res18 Oct 2022 3:04 UTC
170 points
72 comments26 min readLW link2 reviews

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC
42 points
3 comments5 min readLW link

I’ve up­dated to­wards AI box­ing be­ing sur­pris­ingly easy

Noosphere8925 Dec 2022 15:40 UTC
8 points
20 comments2 min readLW link

[Question] Or­a­cle AGI—How can it es­cape, other than se­cu­rity is­sues? (Steganog­ra­phy?)

RationalSieve25 Dec 2022 20:14 UTC
3 points
6 comments1 min readLW link

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
27 points
15 comments1 min readLW link

ChatGPT get­ting out of the box

qbolec16 Mar 2023 13:47 UTC
6 points
3 comments1 min readLW link

Plan­ning to build a cryp­to­graphic box with perfect secrecy

Lysandre Terrisse31 Dec 2023 9:31 UTC
40 points
6 comments11 min readLW link

An AI, a box, and a threat

jwfiredragon7 Mar 2024 6:15 UTC
9 points
0 comments6 min readLW link

Disprov­ing and par­tially fix­ing a fully ho­mo­mor­phic en­cryp­tion scheme with perfect secrecy

Lysandre Terrisse26 May 2024 14:56 UTC
16 points
1 comment18 min readLW link

Would catch­ing your AIs try­ing to es­cape con­vince AI de­vel­op­ers to slow down or un­de­ploy?

Buck26 Aug 2024 16:46 UTC
285 points
69 comments4 min readLW link

The Prag­matic Side of Cryp­to­graph­i­cally Box­ing AI

Bart Jaworski6 Aug 2024 17:46 UTC
6 points
0 comments9 min readLW link

Prov­ably Safe AI: Wor­ld­view and Projects

9 Aug 2024 23:21 UTC
51 points
43 comments7 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

Ideas for stud­ies on AGI risk

dr_s20 Apr 2023 18:17 UTC
5 points
1 comment11 min readLW link

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

In­for­ma­tion-The­o­retic Box­ing of Superintelligences

30 Nov 2023 14:31 UTC
30 points
0 comments7 min readLW link

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

nikola2 Dec 2023 4:22 UTC
15 points
2 comments2 min readLW link

Coun­ter­fac­tual Or­a­cles = on­line su­per­vised learn­ing with ran­dom se­lec­tion of train­ing episodes

Wei Dai10 Sep 2019 8:29 UTC
52 points
26 comments3 min readLW link

Epiphe­nom­e­nal Or­a­cles Ig­nore Holes in the Box

SilentCal31 Jan 2018 20:08 UTC
17 points
8 comments2 min readLW link

I played the AI Box Ex­per­i­ment again! (and lost both games)

Tuxedage27 Sep 2013 2:32 UTC
62 points
123 comments11 min readLW link

AIs and Gate­keep­ers Unite!

Eliezer Yudkowsky9 Oct 2008 17:04 UTC
14 points
163 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong17 Jun 2020 17:44 UTC
60 points
2 comments1 min readLW link

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC
59 points
154 comments3 min readLW link

Or­a­cles, se­quence pre­dic­tors, and self-con­firm­ing predictions

Stuart_Armstrong3 May 2019 14:09 UTC
22 points
0 comments3 min readLW link

Self-con­firm­ing prophe­cies, and sim­plified Or­a­cle designs

Stuart_Armstrong28 Jun 2019 9:57 UTC
10 points
1 comment5 min readLW link

How to es­cape from your sand­box and from your hard­ware host

PhilGoetz31 Jul 2015 17:26 UTC
43 points
28 comments1 min readLW link

Or­a­cle paper

Stuart_Armstrong13 Dec 2017 14:59 UTC
12 points
7 comments1 min readLW link

[FICTION] Un­box­ing Ely­sium: An AI’S Escape

Super AGI10 Jun 2023 4:41 UTC
−16 points
4 comments14 min readLW link

Break­ing Or­a­cles: su­per­ra­tional­ity and acausal trade

Stuart_Armstrong25 Nov 2019 10:40 UTC
25 points
15 comments1 min readLW link

Or­a­cles: re­ject all deals—break su­per­ra­tional­ity, with superrationality

Stuart_Armstrong5 Dec 2019 13:51 UTC
20 points
4 comments8 min readLW link

A way to make solv­ing al­ign­ment 10.000 times eas­ier. The shorter case for a mas­sive open source sim­box pro­ject.

AlexFromSafeTransition21 Jun 2023 8:08 UTC
2 points
16 comments14 min readLW link