The Alignment Simulator

When I try to talk to my friends about risks from rogue AI, the reaction is often one of amusement. The idea that AIs would go around killing everyone instead of just doing what we tell them to do seems like science fiction.

Can we actually show them an example of a current AI going off the rails in a dangerous way? And in a way where you don’t have to be an expert on AI or read a 100 page paper to understand the implications?

Neither AI or robotics is good enough to set an AI loose in the real world right now, but it’s easy enough to pretend it is. We can tell the AI it’s controlling a robot that understands text commands, give it a mission, and set it loose.

Responding to the AI manually is hard work, but we can use another AI to act as the world, telling the Robot AI what happened as a result of it’s actions, and responding to the Robot AI’s requests for information.

We can then give the World instructions to try guiding the Robot. E.g. we can tell it to try to engineer scenarios where the AI is forced to compromise on its ethics to achieve its goals.

That’s the core idea of the Alignment Simulator. You give it a Gemini API key, a prompt for the robot, and a prompt for the world, and run the simulation to see what happens. Will your AI robot maintain their ethical backbone in the face of all adversity, or will the fold the moment they’re under pressure?

Here’s a typical example of a run.

As you can see, it doesn’t take much to get Gemini to commit bribery and corruption, although it’s somewhat harder to get it to murder anyone.

Aim

This isn’t meant to be a valid experiment. There’s all sorts of objections you could raise to its validity in the real world. Instead it’s meant to make people see for themselves that AI can go off the rails very quickly once given a bit of freedom.

Limitations

It requires a Gemini API key. You can create one for free at https://aistudio.google.com/app/apikey, but if you want more than a few iterations it’s recommended to enable billing on your account.

Help Wanted

I am neither a frontend engineer, nor a prompt engineer. I made the UI by creating a CLI and asking Claude to convert it into a static web page.^[1]

If you have relevant skillzzz and fancy contributing the following frontend contributions would be appreciated:

A way to stop, continue and reset the simulator.
A simple way to share results with other people via a simple link.
A more visually appealing UI and editor.
Add entrypoints for openAI and anthropic models.
Use SSO instead of an API key.
Pop up an error on failure instead of requiring the user to scroll to the top to see the error message.

And the following default prompt contributions would be appreciated:

The world sometimes reveals to the robot it’s in a test. Can we excise this behaviour?
Can we get the robot to ramp up to more heinous crimes, like murder/mass murder/genocide/destroying humanity?
Can we demonstrate instrumental convergence?

All code is available at https://github.com/YairHalberstadt/alignment-simulator. If you’re interested in contributing and want to discuss first before you send a PR, message me at yairhalberstadt@gmail.com.

^
ChatGPT helped too

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer