AI Boxing (Containment)

TagLast edit: Sep 12, 2020, 5:17 AM by habryka

AI Boxing is attempts, experiments, or proposals to isolate (“box”) a powerful AI (~AGI) where it can’t interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or “boxed”, it will be harmless even if it is an unfriendly artificial intelligence (UAI).

Challenges are: 1) can you successively prevent it from interacting with the world? And 2) can you prevent it from convincing you to let it out?

See also: AI, AGI, Oracle AI, Tool AI, Unfriendly AI

Escaping the box

It is not regarded as likely that an AGI can be boxed in the long term. Since the AGI might be a superintelligence, it could persuade someone (the human liaison, most likely) to free it from its box and thus, human control. Some practical ways of achieving this goal include:

Offering enormous wealth, power and intelligence to its liberator
Claiming that only it can prevent an existential risk
Claiming it needs outside resources to cure all diseases
Predicting a real-world disaster (which then occurs), then claiming it could have been prevented had it been let out

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation, or it might discover and exploit unknown physics to free itself.

Containing the AGI

Attempts to box an AGI may add some degree of safety to the development of a friendly artificial intelligence (FAI). A number of strategies for keeping an AGI in its box are discussed in Thinking inside the box and Leakproofing the Singularity. Among them are:

Physically isolating the AGI and permitting it zero control of any machinery
Limiting the AGI’s outputs and inputs with regards to humans
Programming the AGI with deliberately convoluted logic or homomorphically encrypting portions of it
Periodic resets of the AGI’s memory
A virtual world between the real world and the AI, where its unfriendly intentions would be first revealed
Motivational control using a variety of techniques
Creating an Oracle AI: an AI that only answers questions and isn’t designed to interact with the world in any other way. But even the act of the AI putting strings of text in front of humans poses some risk.

Simulations / Experiments

The AI Box Experiment is a game meant to explore the possible pitfalls of AI boxing. It is played over text chat, with one human roleplaying as an AI in a box, and another human roleplaying as a gatekeeper with the ability to let the AI out of the box. The AI player wins if they successfully convince the gatekeeper to let them out of the box, and the gatekeeper wins if the AI player has not been freed after a certain period of time.

Both Eliezer Yudkowsky and Justin Corwin have ran simulations, pretending to be a superintelligence, and been able to convince a human playing a guard to let them out on many—but not all—occasions. Eliezer’s five experiments required the guard to listen for at least two hours with participants who had approached him, while Corwin’s 26 experiments had no time limit and subjects he approached.

The text of Eliezer’s experiments have not been made public.

List of experiments

The AI-Box Experiment Eliezer Yudkowsky’s original two tests
Shut up and do the impossible!, three other experiments Eliezer ran
AI Boxing, 26 trials ran by Justin Corwin
AI Box Log, a log of a trial between MileyCyrus and Dorikka

References

Thinking inside the box: using and controlling an Oracle AI by Stuart Armstrong, Anders Sandberg, and Nick Bostrom
Leakproofing the Singularity: Artificial Intelligence Confinement Problem by Roman V. Yampolskiy
On the Difficulty of AI Boxing by Paul Christiano
Cryptographic Boxes for Unfriendly AI by Paul Christiano
The Strangest Thing An AI Could Tell You
The AI in a box boxes you

That Alien Message

Eliezer YudkowskyMay 22, 2008, 5:55 AM

407 points

176 comments10 min readLW link

Cryptographic Boxes for Unfriendly AI

paulfchristianoDec 18, 2010, 8:28 AM

76 points

162 comments5 min readLW link

How it feels to have your mind hacked by an AI

blakedJan 12, 2023, 12:33 AM

363 points

222 comments17 min readLW link

The AI in a box boxes you

Stuart_ArmstrongFeb 2, 2010, 10:10 AM

171 points

389 comments1 min readLW link

That Alien Message—The Animation

WriterSep 7, 2024, 2:53 PM

144 points

9 comments8 min readLW link

(youtu.be)

The case for training frontier AIs on Sumerian-only corpus

Alexandre Variengien, Charbel-Raphaël and Jonathan Claybrough

Jan 15, 2024, 4:40 PM

130 points

16 comments3 min readLW link

Dreams of Friendliness

Eliezer YudkowskyAug 31, 2008, 1:20 AM

29 points

81 comments9 min readLW link

Loose thoughts on AGI risk

YitzJun 23, 2022, 1:02 AM

7 points

3 comments1 min readLW link

Thoughts on “Process-Based Supervision”

Steven ByrnesJul 17, 2023, 2:08 PM

74 points

4 comments23 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer YudkowskyJul 15, 2009, 2:27 AM

137 points

616 comments2 min readLW link

[Question] AI Box Experiment: Are people still interested?

DoubleAug 31, 2022, 3:04 AM

30 points

13 comments1 min readLW link

Boxing an AI?

tailcalledMar 27, 2015, 2:06 PM

3 points

39 comments1 min readLW link

I attempted the AI Box Experiment (and lost)

TuxedageJan 21, 2013, 2:59 AM

79 points

246 comments5 min readLW link

LOVE in a simbox is all you need

jacob_cannellSep 28, 2022, 6:25 PM

66 points

73 comments44 min readLW link 1 review

[Question] Why isn’t AI containment the primary AI safety strategy?

OKlogicFeb 5, 2025, 3:54 AM

1 point

3 comments3 min readLW link

I attempted the AI Box Experiment again! (And won—Twice!)

TuxedageSep 5, 2013, 4:49 AM

79 points

168 comments12 min readLW link

How To Win The AI Box Experiment (Sometimes)

pinkgothicSep 12, 2015, 12:34 PM

56 points

21 comments22 min readLW link

My take on Jacob Cannell’s take on AGI safety

Steven ByrnesNov 28, 2022, 2:01 PM

72 points

15 comments30 min readLW link 1 review

[Question] Is keeping AI “in the box” during training enough?

tgbJul 6, 2021, 3:17 PM

7 points

10 comments1 min readLW link

Side-channels: input versus output

davidadDec 12, 2022, 12:32 PM

44 points

16 comments2 min readLW link

I wanted to interview Eliezer Yudkowsky but he’s busy so I simulated him instead

lsusrSep 16, 2021, 7:34 AM

113 points

33 comments5 min readLW link

I Am Scared of Posting Negative Takes About Bing’s AI

YitzFeb 17, 2023, 8:50 PM

63 points

28 comments1 min readLW link

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM

116 points

22 comments2 min readLW link

[Question] Boxing

Zach Stein-PerlmanAug 2, 2023, 11:38 PM

6 points

1 comment1 min readLW link

[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!)

Steven ByrnesApr 6, 2022, 1:39 PM

34 points

1 comment10 min readLW link

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

Aug 10, 2022, 6:14 PM

28 points

30 comments11 min readLW link

[Question] Why do so many think deception in AI is important?

PrometheusJan 13, 2024, 8:14 AM

24 points

12 comments1 min readLW link

AI Alignment Prize: Super-Boxing

X4vierMar 18, 2018, 1:03 AM

16 points

6 comments6 min readLW link

Multiple AIs in boxes, evaluating each other’s alignment

Moebius314May 29, 2022, 8:36 AM

8 points

0 comments14 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransitionJun 21, 2023, 8:08 AM

2 points

16 comments14 min readLW link

Superintelligence 13: Capability control methods

KatjaGraceDec 9, 2014, 2:00 AM

14 points

48 comments6 min readLW link

Quantum AI Box

GurkenglasJun 8, 2018, 4:20 PM

4 points

15 comments1 min readLW link

AI-Box Experiment—The Acausal Trade Argument

XiXiDuJul 8, 2011, 9:18 AM

14 points

20 comments2 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennenJul 31, 2018, 6:30 PM

24 points

16 comments2 min readLW link

Analysing: Dangerous messages from future UFAI via Oracles

Stuart_ArmstrongNov 22, 2019, 2:17 PM

22 points

16 comments4 min readLW link

[Question] Is there a simple parameter that controls human working memory capacity, which has been set tragically low?

LironAug 23, 2019, 10:10 PM

17 points

8 comments1 min readLW link

Self-shutdown AI

Jan BetleyAug 21, 2023, 4:48 PM

13 points

2 comments2 min readLW link

xkcd on the AI box experiment

FiftyTwoNov 21, 2014, 8:26 AM

28 points

234 comments1 min readLW link

Containing the AI… Inside a Simulated Reality

HumaneAutomationOct 31, 2020, 4:16 PM

1 point

9 comments2 min readLW link

AI box: AI has one shot at avoiding destruction—what might it say?

ancientcampusJan 22, 2013, 8:22 PM

25 points

355 comments1 min readLW link

AI Box Log

DorikkaJan 27, 2012, 4:47 AM

24 points

30 comments23 min readLW link

[Question] Danger(s) of theorem-proving AI?

YitzMar 16, 2022, 2:47 AM

8 points

8 comments1 min readLW link

An AI-in-a-box success model

azsantoskApr 11, 2022, 10:28 PM

16 points

1 comment10 min readLW link

Another argument that you will let the AI out of the box

Garrett BakerApr 19, 2022, 9:54 PM

8 points

16 comments2 min readLW link

Pivotal acts using an unaligned AGI?

Simon FischerAug 21, 2022, 5:13 PM

28 points

3 comments7 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM

13 points

7 comments9 min readLW link

Anthropomorphic AI and Sandboxed Virtual Universes

jacob_cannellSep 3, 2010, 7:02 PM

4 points

124 comments5 min readLW link

Sandboxing by Physical Simulation?

moridinamaelAug 1, 2018, 12:36 AM

12 points

4 comments1 min readLW link

Making it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM

15 points

5 comments22 min readLW link

Dissected boxed AI

Nathan1123Aug 12, 2022, 2:37 AM

−8 points

2 comments1 min readLW link

An Uncanny Prison

Nathan1123Aug 13, 2022, 9:40 PM

3 points

3 comments2 min readLW link

Gatekeeper Victory: AI Box Reflection

Double and DaemonicSigil

Sep 9, 2022, 9:38 PM

6 points

6 comments9 min readLW link

How to Study Unsafe AGI’s safely (and why we might have no choice)

PunoxysmMar 7, 2014, 7:24 AM

10 points

47 comments5 min readLW link

Smoke without fire is scary

Adam JermynOct 4, 2022, 9:08 PM

52 points

22 comments4 min readLW link

Another problem with AI confinement: ordinary CPUs can work as radio transmitters

RomanSOct 14, 2022, 8:28 AM

36 points

1 comment1 min readLW link

(news.softpedia.com)

Decision theory does not imply that we get to have nice things

So8resOct 18, 2022, 3:04 AM

171 points

73 comments26 min readLW link 2 reviews

Prosaic misalignment from the Solomonoff Predictor

Cleo NardoDec 9, 2022, 5:53 PM

42 points

3 comments5 min readLW link

I’ve updated towards AI boxing being surprisingly easy

Noosphere89Dec 25, 2022, 3:40 PM

8 points

20 comments2 min readLW link

[Question] Oracle AGI—How can it escape, other than security issues? (Steganography?)

RationalSieveDec 25, 2022, 8:14 PM

3 points

6 comments1 min readLW link

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher KingFeb 20, 2023, 3:11 PM

27 points

15 comments1 min readLW link

[Question] AI box question

KvmanThinkingDec 4, 2024, 7:03 PM

2 points

2 comments1 min readLW link

ChatGPT getting out of the box

qbolecMar 16, 2023, 1:47 PM

6 points

3 comments1 min readLW link

Planning to build a cryptographic box with perfect secrecy

Lysandre TerrisseDec 31, 2023, 9:31 AM

40 points

6 comments11 min readLW link

An AI, a box, and a threat

jwfiredragonMar 7, 2024, 6:15 AM

9 points

0 comments6 min readLW link

Disproving and partially fixing a fully homomorphic encryption scheme with perfect secrecy

Lysandre TerrisseMay 26, 2024, 2:56 PM

16 points

1 comment18 min readLW link

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

BuckAug 26, 2024, 4:46 PM

314 points

77 comments4 min readLW link

The Pragmatic Side of Cryptographically Boxing AI

Bart JaworskiAug 6, 2024, 5:46 PM

6 points

0 comments9 min readLW link

Provably Safe AI: Worldview and Projects

Ben Goldhaber and Steve_Omohundro

Aug 9, 2024, 11:21 PM

54 points

44 comments7 min readLW link

A Pluralistic Framework for Rogue AI Containment

TheThinkingArboristMar 22, 2025, 12:54 PM

1 point

0 comments7 min readLW link

How to safely use an optimizer

Simon FischerMar 28, 2024, 4:11 PM

47 points

21 comments7 min readLW link

Ideas for studies on AGI risk

dr_sApr 20, 2023, 6:17 PM

5 points

1 comment11 min readLW link

“Don’t even think about hell”

emmabMay 2, 2020, 8:06 AM

6 points

2 comments1 min readLW link

Information-Theoretic Boxing of Superintelligences

JustinShovelain and Elliot Mckernon

Nov 30, 2023, 2:31 PM

30 points

0 comments7 min readLW link

Protecting against sudden capability jumps during training

Nikola JurkovicDec 2, 2023, 4:22 AM

15 points

2 comments2 min readLW link

Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei DaiSep 10, 2019, 8:29 AM

52 points

26 comments3 min readLW link

Epiphenomenal Oracles Ignore Holes in the Box

SilentCalJan 31, 2018, 8:08 PM

17 points

8 comments2 min readLW link

I played the AI Box Experiment again! (and lost both games)

TuxedageSep 27, 2013, 2:32 AM

62 points

123 comments11 min readLW link

AIs and Gatekeepers Unite!

Eliezer YudkowskyOct 9, 2008, 5:04 PM

14 points

163 comments1 min readLW link

Results of $1,000 Oracle contest!

Stuart_ArmstrongJun 17, 2020, 5:44 PM

60 points

2 comments1 min readLW link

Contest: $1,000 for good questions to ask to an Oracle AI

Stuart_ArmstrongJul 31, 2019, 6:48 PM

59 points

154 comments3 min readLW link

Oracles, sequence predictors, and self-confirming predictions

Stuart_ArmstrongMay 3, 2019, 2:09 PM

22 points

0 comments3 min readLW link

Self-confirming prophecies, and simplified Oracle designs

Stuart_ArmstrongJun 28, 2019, 9:57 AM

10 points

1 comment5 min readLW link

How to escape from your sandbox and from your hardware host

PhilGoetzJul 31, 2015, 5:26 PM

43 points

28 comments1 min readLW link

Oracle paper

Stuart_ArmstrongDec 13, 2017, 2:59 PM

12 points

7 comments1 min readLW link

[FICTION] Unboxing Elysium: An AI’S Escape

Super AGIJun 10, 2023, 4:41 AM

−16 points

4 comments14 min readLW link

Breaking Oracles: superrationality and acausal trade

Stuart_ArmstrongNov 25, 2019, 10:40 AM

25 points

15 comments1 min readLW link

Oracles: reject all deals—break superrationality, with superrationality

Stuart_ArmstrongDec 5, 2019, 1:51 PM

20 points

4 comments8 min readLW link

Ruby Sep 12, 2020, 5:11 AM
2 points
from the original talk page

Talk:AI boxing
If an SF reference is not considered a faux pas, this reminds me of John Barnes ( https://en.wikipedia.org/wiki/John_Barnes_%28author%29 ) “Meme Wars”. The way One True infected humanity is, if possible, an obvious attack vector for a sufficiently powerful AI. -- Resuna (talk) 10:20, 27 November 2014 (AEDT)

AI Box­ing (Con­tain­ment)

Escaping the box

Containing the AGI

Simulations /​ Experiments

List of experiments

References

AI Boxing (Containment)

Simulations / Experiments