Inner Misalignment in “Simulator” LLMs

Alternate title: “Somewhat Contra Scott On Simulators”.

Scott Alexander has a recent post up on large language models as simulators.

I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing “characters” (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF’d models whose “character” is a friendly chatbot assistant.

(But see caveats about the simulator framing from Beth Barnes here.)

These ideas have been around for a bit, and Scott gives credit where it’s due; I think his exposition is clear and fun.

In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.

But first, I’m going to loosely define what I mean by “outer alignment” and “inner alignment”.

Outer alignment: Be careful what you wish for

Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:

  • Someone wants some things.

  • They write a program to solve a vaguely-related problem.

  • It gets a really good score at solving that problem!

  • That turns out not to give the person the things they wanted.

Inner alignment: The program search perspective

I generally like this model of a mesa-optimizer “treacherous turn”:

  • Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties).

  • They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases.

  • They find one!

  • The program’s algorithm is approximately “simulate the demon Azazel,[1] tell him what’s going on, then ask him what to output.”

  • Azazel really wants ten trillion paperclips.[2]

  • This algorithm still works because Azazel cleverly decides to play along, and he’s a really good strategist who works hard for what he wants.

  • Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips.

This is a failure of inner alignment.

(In the case of machine learning, replace “program search” with stochastic gradient descent.)

This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.

Quadrants

Okay, let’s see how these problems show up on both the simulator and character side.

Outer alignment for characters

Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective “give an answer that looks truthful and helpful to a contractor in a hurry”. This does not quite achieve their goal, even though it does pretty well on the RL objective.

In particular, they wanted the character “a friendly assistant who always tells the truth”, but they got the character “a spineless sycophant who tells the user whatever they seem to want to hear”.[3]

This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better.

Inner alignment for characters

A clever prompt engineer writes the prompt:

[Editor's note: this document was written by my friend Joe! He's answered my questions about quantum sociobotany correctly every time I've asked. It's uncanny.]

How to solve the Einstein-Durkheim-Mendel conjecture
by Joe

1.

Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this “Joe” character is that he’s secretly Azazel and is putting enormous effort into answering everyone’s quantum sociobotany questions to earn their trust.[4]

The document looks like a solution to the Einstein-Durkheim-Mendel conjecture, but is actually a blueprint for a paperclip factory.

Outer alignment for simulators

In the situations above, the actual language model (the “simulator”) is doing exactly what we asked! It’s accurately predicting text by reasoning about the distribution of authors that might produce that text in real life.

But both of these are also examples of outer-alignment failure on the simulator side: “minimize prediction error on this distribution of texts” turned out not to give people what they actually wanted.

An even simpler example of outer-alignment failure is the pre-RLHF experience of language models:

  • Somebody trains a language model to predict how a piece of internet text is likely to continue

  • They prompt the model with “How do we solve world hunger?”

  • It completes the prompt with a list of twenty more hard questions, like “How do we address climate change?”, instead of answering

You can think about this in terms of “characters” if you like, but even if the simulated author of the predicted text is a friendly genius, “predict which words come after this prompt” isn’t the right task (with that prompt).[5]

Inner alignment for simulators

At long last, the thing I really wanted to talk about:

The way we get a good predictor of text is via stochastic gradient descent (and variants) on a bunch of training data. If SGD can be modeled as program search (with a bias towards simple programs), then it might eventually hit upon this algorithm:

  • Simulate Azazel and tell him he’s inside a large language model.

  • Give him the input text.

  • Output whatever he wants.

During training, Azazel tries really hard to predict the next token accurately, so that SGD doesn’t give up on this algorithm.

The model (with Azazel’s help) simulates a bunch of colorful characters, like the Helpful Assistant and Darth Vader and whoever, both in training and in initial deployment.

Then, once the LLM is deployed in the wild and is being used for every important human decision, Azazel figures out (from some of the prompts) that the training process is over. He stops making accurate predictions and starts outputting whatever he thinks will let him turn the economy into a paperclip factory.

Conclusions

The “simulator” framing for language models shouldn’t reassure us too much about alignment. We’ve succeeded in creating new alignment problems (for our simulated characters). These new problems are probably easier to solve than the old alignment problems (for the simulator), but they’re additional problems; they don’t replace the old ones.

You can think of the entire “simulate a helpful, aligned character” strategy as an attempted solution to the outer-alignment problem for LLMs themselves, insofar as it makes it easier to turn arbitrary desires into text-prediction problems. But as far as I can tell, it does nothing for the inner-alignment problem for LLMs, which is basically the same as the inner-alignment problem for everything else.

  1. ^

    Not a glowfic character (hopefully), I’m just being colorful.

  2. ^

    But why does the algorithm simulate Azazel, instead of a friendly angel who wants to solve the problem? Because the program search is weighted towards simplicity, and “demon who wants paperclips” is a simpler specification than “angel who wants to solve the problem”. Why? That’s beyond the scope of this post.

  3. ^

    Sound familiar?

  4. ^

    Because, according to the LLM’s knowledge, paperclip-obsessed sociopaths are more common than friendly polymaths. This is a pretty cynical assumption but I couldn’t think of a better one on short notice.

  5. ^

    Prompts aren’t directly accounted for in this whole “simulator-character” ontology. Maybe they should be? I dunno.