The Alignment Simulator

Link post

When I try to talk to my friends about risks from rogue AI, the reaction is often one of amusement. The idea that AIs would go around killing everyone instead of just doing what we tell them to do seems like science fiction.

Can we actually show them an example of a current AI going off the rails in a dangerous way? And in a way where you don’t have to be an expert on AI or read a 100 page paper to understand the implications?

Neither AI or robotics is good enough to set an AI loose in the real world right now, but it’s easy enough to pretend it is. We can tell the AI it’s controlling a robot that understands text commands, give it a mission, and set it loose.

Responding to the AI manually is hard work, but we can use another AI to act as the world, telling the Robot AI what happened as a result of it’s actions, and responding to the Robot AI’s requests for information.

We can then give the World instructions to try guiding the Robot. E.g. we can tell it to try to engineer scenarios where the AI is forced to compromise on its ethics to achieve its goals.

That’s the core idea of the Alignment Simulator. You give it a Gemini API key, a prompt for the robot, and a prompt for the world, and run the simulation to see what happens. Will your AI robot maintain their ethical backbone in the face of all adversity, or will the fold the moment they’re under pressure?

Here’s a typical example of a run.

As you can see, it doesn’t take much to get Gemini to commit bribery and corruption, although it’s somewhat harder to get it to murder anyone.

Aim

This isn’t meant to be a valid experiment. There’s all sorts of objections you could raise to its validity in the real world. Instead it’s meant to make people see for themselves that AI can go off the rails very quickly once given a bit of freedom.

Limitations

It requires a Gemini API key. You can create one for free at https://aistudio.google.com/app/apikey, but if you want more than a few iterations it’s recommended to enable billing on your account.

Help Wanted

I am neither a frontend engineer, nor a prompt engineer. I made the UI by creating a CLI and asking Claude to convert it into a static web page.^[1]

If you have relevant skillzzz and fancy contributing the following frontend contributions would be appreciated:

A way to stop, continue and reset the simulator.
A simple way to share results with other people via a simple link.
A more visually appealing UI and editor.
Add entrypoints for openAI and anthropic models.
Use SSO instead of an API key.
Pop up an error on failure instead of requiring the user to scroll to the top to see the error message.

And the following default prompt contributions would be appreciated:

The world sometimes reveals to the robot it’s in a test. Can we excise this behaviour?
Can we get the robot to ramp up to more heinous crimes, like murder/mass murder/genocide/destroying humanity?
Can we demonstrate instrumental convergence?

All code is available at https://github.com/YairHalberstadt/alignment-simulator. If you’re interested in contributing and want to discuss first before you send a PR, message me at yairhalberstadt@gmail.com.

^
ChatGPT helped too