Inner Misalignment in “Simulator” LLMs
Alternate title: “Somewhat Contra Scott On Simulators”.
Scott Alexander has a recent post up on large language models as simulators.
I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing “characters” (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF’d models whose “character” is a friendly chatbot assistant.
(But see caveats about the simulator framing from Beth Barnes here.)
These ideas have been around for a bit, and Scott gives credit where it’s due; I think his exposition is clear and fun.
In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.
But first, I’m going to loosely define what I mean by “outer alignment” and “inner alignment”.
Outer alignment: Be careful what you wish for
Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:
Someone wants some things.
They write a program to solve a vaguely-related problem.
It gets a really good score at solving that problem!
That turns out not to give the person the things they wanted.
Inner alignment: The program search perspective
I generally like this model of a mesa-optimizer “treacherous turn”:
Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties).
They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases.
They find one!
The program’s algorithm is approximately “simulate the demon Azazel,[1] tell him what’s going on, then ask him what to output.”
Azazel really wants ten trillion paperclips.[2]
This algorithm still works because Azazel cleverly decides to play along, and he’s a really good strategist who works hard for what he wants.
Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips.
This is a failure of inner alignment.
(In the case of machine learning, replace “program search” with stochastic gradient descent.)
This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.
Quadrants
Okay, let’s see how these problems show up on both the simulator and character side.
Outer alignment for characters
Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective “give an answer that looks truthful and helpful to a contractor in a hurry”. This does not quite achieve their goal, even though it does pretty well on the RL objective.
In particular, they wanted the character “a friendly assistant who always tells the truth”, but they got the character “a spineless sycophant who tells the user whatever they seem to want to hear”.[3]
This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better.
Inner alignment for characters
A clever prompt engineer writes the prompt:
[Editor's note: this document was written by my friend Joe! He's answered my questions about quantum sociobotany correctly every time I've asked. It's uncanny.]
How to solve the Einstein-Durkheim-Mendel conjecture
by Joe
1.
Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this “Joe” character is that he’s secretly Azazel and is putting enormous effort into answering everyone’s quantum sociobotany questions to earn their trust.[4]
The document looks like a solution to the Einstein-Durkheim-Mendel conjecture, but is actually a blueprint for a paperclip factory.
Outer alignment for simulators
In the situations above, the actual language model (the “simulator”) is doing exactly what we asked! It’s accurately predicting text by reasoning about the distribution of authors that might produce that text in real life.
But both of these are also examples of outer-alignment failure on the simulator side: “minimize prediction error on this distribution of texts” turned out not to give people what they actually wanted.
An even simpler example of outer-alignment failure is the pre-RLHF experience of language models:
Somebody trains a language model to predict how a piece of internet text is likely to continue
They prompt the model with “How do we solve world hunger?”
It completes the prompt with a list of twenty more hard questions, like “How do we address climate change?”, instead of answering
You can think about this in terms of “characters” if you like, but even if the simulated author of the predicted text is a friendly genius, “predict which words come after this prompt” isn’t the right task (with that prompt).[5]
Inner alignment for simulators
At long last, the thing I really wanted to talk about:
The way we get a good predictor of text is via stochastic gradient descent (and variants) on a bunch of training data. If SGD can be modeled as program search (with a bias towards simple programs), then it might eventually hit upon this algorithm:
Simulate Azazel and tell him he’s inside a large language model.
Give him the input text.
Output whatever he wants.
During training, Azazel tries really hard to predict the next token accurately, so that SGD doesn’t give up on this algorithm.
The model (with Azazel’s help) simulates a bunch of colorful characters, like the Helpful Assistant and Darth Vader and whoever, both in training and in initial deployment.
Then, once the LLM is deployed in the wild and is being used for every important human decision, Azazel figures out (from some of the prompts) that the training process is over. He stops making accurate predictions and starts outputting whatever he thinks will let him turn the economy into a paperclip factory.
Conclusions
The “simulator” framing for language models shouldn’t reassure us too much about alignment. We’ve succeeded in creating new alignment problems (for our simulated characters). These new problems are probably easier to solve than the old alignment problems (for the simulator), but they’re additional problems; they don’t replace the old ones.
You can think of the entire “simulate a helpful, aligned character” strategy as an attempted solution to the outer-alignment problem for LLMs themselves, insofar as it makes it easier to turn arbitrary desires into text-prediction problems. But as far as I can tell, it does nothing for the inner-alignment problem for LLMs, which is basically the same as the inner-alignment problem for everything else.
- ^
Not a glowfic character (hopefully), I’m just being colorful.
- ^
But why does the algorithm simulate Azazel, instead of a friendly angel who wants to solve the problem? Because the program search is weighted towards simplicity, and “demon who wants paperclips” is a simpler specification than “angel who wants to solve the problem”. Why? That’s beyond the scope of this post.
- ^
Sound familiar?
- ^
Because, according to the LLM’s knowledge, paperclip-obsessed sociopaths are more common than friendly polymaths. This is a pretty cynical assumption but I couldn’t think of a better one on short notice.
- ^
Prompts aren’t directly accounted for in this whole “simulator-character” ontology. Maybe they should be? I dunno.
Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case:
For outer alignment, say researchers want a chatbot that gives helpful, honest answers—but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.
(This will appear in this week’s forum summary. If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
I think this is missing an important part of the post.
I have subsections on (what I claim are) four distinct alignment problems:
Outer alignment for characters
Inner alignment for characters
Outer alignment for simulators
Inner alignment for simulators
This summary covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).
Interesting, I think this clarifies things, but the framing also isn’t quite as neat as I’d like.
I’d be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator—Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren’t actually literal?
• Inner alignment for a simulator—Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters—finding a character who would create good outcomes if successfully simulated
In this model, there wouldn’t be a separate notion of inner alignment for characters as that would be automatic if the simulator was both inner and outer aligned.
Thoughts?
I have a post from a while back with a section that aims to do much the same thing you’re doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call “inner alignment for characters”, I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we’re sure that that’s what it’s actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that’s not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what’s the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?
I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel… benign? Like, in the vein of “these don’t pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems”, rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the “risk” aspect collapses down into outer and inner alignment for simulators again.
Broadly agreed. I’d written a similar analysis of the issue before, where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated.
My current best argument for it goes as follows:
On one hand, I fully agree that a strong predictor is going develop some very strong internal modeling that could reasonably be considered superhuman in some ways even now.
But I think there’s an unstated background assumption sneaking into most discussions about mesaoptimizers- that goal oriented agency (even with merely shard-like motivations) is a natural attractor for SGD, particularly in the context of outwardly goal agnostic simulators.
This could be true, and it would be extremely important if it were true, and I really want more people trying to figure out if it is true, but so far as I’m aware, we don’t have strong evidence that it is.
My personal guess, given what I know now, is that some form of weakly defined mesaoptimization is an attractor (>90%), but agentic mesaoptimizers in the context of non-fine-tuned GPT-like architectures are not (75%).
I think agentic mesaoptimization can be an attractor in some architectures. I’m comfortable claiming humans in the context of evolution as a close-enough existence proof of this. I think the conditions of our optimization made agentic mesaoptimization natural, but I suspect optimization processes with wildly different conditions will behave differently.
This is a big part of why I’m as optimistic as I am about goal agnostic simulation as a toehold for safety- I think we actually do replace one set of problems with an easier set of problems, rather than just adding more.
I don’t see why projecting logits from the residual stream should require anything like search. In fact, the logit lens seems like strong evidence against this being the case, since it shows that intermediate hidden representations are just one linear transformation away from making predictions about the vocab distribution.
It’s not like SGD is sampling random programs, conditioning only on those programs achieving low loss.
Yeah, I’ll pile on in agreement.
I feel like thinking of the internals of transformers as doing general search—especially search over things to simulate—is some kind of fallacy. The system as a whole (the transformer) outputs a simulation of the training distribution, but that doesn’t mean it’s made of parts that themselves do simulations, or that refer to “simulating a thing” as a basic part of some internal ontology.
I think “classic” inner alignment failure (where some inner Azazel has preferences about the real world) is a procrustean bed—it fits an RL agent navigating the real world, but not so much a pure language model.
I mean, that just pushes the problem back by one step. If we take LLMs to be simulators, they’d necessarily need to have some function that maps the simulation-state to the probability over the output tokens (since, after all, the ground truth of reality they’re simulating isn’t probability distributions over tokens).
And if LLMs work by gradually refining the probability distribution over the output which they keep in the residual stream, that would just imply that the “simulation-state ⇒ output distribution” functions are upstream of the residual stream — i. e., every intermediate layer both runs a simulation-step and figures out how to map that simulation’s new state into a distribution-over-outputs.
Of course, it seems architecturally impossible for modern LLMs to run a general-purpose search at that step, but in my view it’s an argument against modern LLM architectures being AGI-complete, not against search being unnecessary.
I disagree with this picture. “Simulators” just describes the external behavior of the model, and doesn’t imply LLMs internally function anything like the programs humans write when we want to simulate something, or like our intuitive notions of what a simulator ought to do.
I think it’s better to start with what we’ve found of deep network internal structures, which seem to be exponentially large ensembles of fairly shallow paths, and then think about what sort of computational structures would be consistent with that information while also 1) achieving low loss, and 2) being plausibly findable by SGD from a random init.
My tentative guess is that LLMs internally look like a fuzzy key-value lookup table over a vast quantity of (mostly shallow) patterns about text content. They do some sort of similarity matching between the input texts and the features that different stored patterns “expect” in any text to which the pattern applies. Any patterns which trigger then quickly add their predictions into the residual stream, similar to what’s described here.
In such a structure, having any significant translation step between the internal states of the predictive patterns and the output logits would be a huge issue, because you’d have to replicate that translation across the network many times, not just once per layer, but many times per layer, because single layers are implementing many ~independent paths simultaneously.
I do agree that LLM architectures seem poorly suited to learning the sorts of algorithms I think people imagine when they say stuff like “general purpose search”. However, I take that as an update against those sorts of algorithms being important for powerful cognition, essentially considering that transformers have been the SOTA architecture for over 5 years while remaining essentially unchanged, despite many, many people trying to improve on them.
Fair enough, I don’t disagree that it’s how current LLMs likely work.
I maintain, however, that it makes me very skeptical that their architecture is AGI-complete. In particular, I expect it’s incapable of supporting the sort of high-fidelity simulations that people often talk about in the context of e. g. accelerating alignment research. And that, on the contrary, the architectures that are powerful enough would be different enough to support search and therefore carry the dangers of inner misalignment.
I can sort of see the alternate picture, though, where the shallow patterns they implement include some sort of general-enough planning heuristics that’d theoretically let them make genuinely novel inferences over enough steps. I think that’d run into severe inefficiencies… but my intuition on that is a bit difficult to unpack.
Hm. Do you think the current LLM architectures are AGI-complete, if you scale them enough? If yes, how do you imagine they’d be carrying out novel inferences, mechanically? Inferences that require making use of novel abstractions?
Not a very technical objection but I have to say, I feel like simulating the demon Azazel who wants to maximize paperclips but is good at predicting text because he’s a clever, hardworking strategist… doesn’t feel very simple to me at all? It seems like a program that just predicts text would almost have to be simpler than simulating a genius mind with some other goal which cleverly chooses to predict text for instrumental reasons, to me.