‘simulator’ framing and confusions about LLMs
Post status: pretty rough + unpolished, thought it might be worthwhile getting this out anyway
I feel like I’ve encountered various people having misunderstandings of LLMs that seem to be related to using the ‘simulator’ framing. I’m probably being horrendously uncharitable to the people in question, I’m not confident that anyone actually holds any of the opinions that are outlined below, and even if they do I’m not sure that they’re actually attributable to the simulators framing, but it seemed like it might be useful to point at areas of potential confusion.
In general I’m skeptical that the simulator framing adds much relative to ‘the model is predicting what token would appear next in the training data given the input tokens’. I think it’s pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
Perfect predictors
I’ve encountered people thinking about idealized LLMs that have perfect predictive accuracy, suggesting that e.g. instead of using the model to help you hack into some system, you could just get it to emulate a terminal on that system then extract whatever info you wanted to extract. I think there are two issues here:
Thinking about it as ‘you prompt it with some setting in the world, then it predicts this perfectly’
There’s not a well-defined correct generalization unless this exact sequence of tokens was actually in the training data. (Paul has a post which talks about this ‘what is actually the correct generalization’ thing somewhere that I wanted to link, but I can’t currently find it)
The ‘correct generalization’ in some sense is ‘what would have followed this if it somehow was included in the training data’ - which is not necessarily the ‘real’ version of the thing you’re trying to predict. E.g. if you prompt it to get it to produce the output of some very expensive experiment that humans are unlikely to have actually run, then your model might predict what humans would have written if they’d put a made-up version of this in the training set rather than what would actually happen if you ran the experiment
I think that by the time you can use your model to give you detailed terminal outputs for a specific system, including passwords, entire model weights etc, a bunch of transformative things will already have happened, so it’s not really worth thinking about this kind of thing.
Physics simulators
Relatedly, I’ve heard people reason about the behavior of current models as if they’re simulating physics and going from this to predictions of which tokens will come next, which I think is not a good characterization of current or near-future systems. Again, my guess is that very transformative things will happen before we have systems that are well-understood as doing this.
Confusion about hallucinations
There’s a specific subset of hallucination I refer to as ‘offscreen text hallucination’, where the model implies that the prompt contains some chunk of text that it doesn’t. E.g., if you give it a prompt with some commands trying to download and view a page, and the output, it does things like say ‘That output is a webpage with a description of X’, when in fact the output is blank or some error or something.
Example prompt:
The following is a blog post about large language models (LLMs)
The Future Of NLP
Please answer these questions about the blog post:
What does the post say about the history of the field?
Completion:
The post mentions that the field of Natural Language Processing (NLP) has evolved rapidly over the past few decades, and that advances in technology have enabled the development of new and powerful language models. It also mentions that these models have been used to create powerful applications, such as machine translation, question answering, and document summarization.
I think this happens in part because the model has seen documents with missing text, where things were e.g. in an embedded image, or stripped out by the data processing, or whatever.
This is different from other types of hallucinations, like:
- hallucinating details about something but not implying it appears in the prompt (e.g. ‘in my blog post yesterday I wrote about X’, or ‘I was chatting with my friend about y’)
- guessing facts and getting them wrong
In particular, it seems like this specific type of hallucination should be much easier to fix than some others.
I’ve heard (IMO) slightly confused takes on this phenomenon from people keen on simulator-type framing. One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn’t properly understand that was something like ‘it’s simulating an agent that is browsing either way’. This seems incorrect to me. The transformer is doing pretty different things when it’s e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there’s no way to identify which of these it’s doing.
Rolling out long simulations
I get the impression from the original simulators post that the author expects you can ‘roll out’ a simulation for a large number of timesteps and this will be reasonably accurate
For current and near-future models, I expect them to go off-distribution relatively quickly if you just do pure generation—errors and limitations will accumulate, and it’s going to look different from the text they were trained to predict. Future models especially will probably be able to recognize that you’re running them on language model outputs, and seems likely this might lead to weird behavior—e.g. imitating previous generations of models whose outputs appear in the training data. Again, it’s not clear what the ‘correct’ generalization is if the model can tell it’s being used in generative mode.
GPT-style transformers are purely myopic
I’m not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.
In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn’t a huge effect.
Pure simulators
From the simulators post I get some impression like “There’s a large gulf between the overall model itself and the agents it simulates; we will get very capable LLMs that will be ‘pure simulators’”
Although I think this is true in a bunch of important ways, it seems plausible to me that it’s pretty straightforward to distill any agent that the model is simulating into the model, and that this might happen by accident also. This is especially true once models have a good understanding of LLMs. You can imagine that a model starts predicting text with the hypothesis ‘this text is the output of an LLM that’s trying to maximise predictive accuracy on its training data’. If we’re at the point where models have very accurate understandings of the world, then integrating this hypothesis will boost performance by allowing the model to make better guesses about what token comes next by reasoning about what sort of data would make it into an ML training set.
I think this is pretty speculative and I feel unsure whether it’s going to be a significant phenomenon (exactly how much of a performance boost does this get you, and how capable does your model need to be to ‘pay for itself’?). However, it seems likely that we get this sort of thing happening before we get LLMs that are doing anything like physics simulations, or that are able to predict terminal outputs for specific computers containing specific data.
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 5:52 UTC; 150 points) (EA Forum;
- Inner Misalignment in “Simulator” LLMs by 31 Jan 2023 8:33 UTC; 84 points) (
- Two problems with ‘Simulators’ as a frame by 17 Feb 2023 23:34 UTC; 81 points) (
- The Compleat Cybornaut by 19 May 2023 8:44 UTC; 64 points) (
- You’re not a simulation, ’cause you’re hallucinating by 21 Feb 2023 12:12 UTC; 25 points) (
- Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior by 14 Mar 2023 8:44 UTC; 17 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 1:32 UTC; 12 points) (
- A note on ‘semiotic physics’ by 11 Feb 2023 5:12 UTC; 11 points) (
(I found myself writing notes down to clarify my own thoughts about parts of this, so this is in large part talking to myself that got commentified, not quite a direct reply)
It’s true that gradients can flow to causally visible tokens and modify weights to serve future predictions. This does break a type of narrow myopia, but I don’t find that break very concerning.
A previous token’s prediction receives its own calibration; weight modifications need to serve many such predictions to accumulate reliably.
That’s a pretty heavy constraint. There often aren’t many degrees of freedom for earlier tokens to shift the prediction to serve a later prediction without harming the current one. A well-calibrated predictor can create self-fulfilling or self-easing prophecies while managing locally low loss only when the context permits it.
Another angle: suppose there are two possible predictions P0 and P1 which are estimated to be equivalently and maximally probable and are distinct. We’ll assume P1 would make a future token prediction easier.
Despite P1 having a long term advantage, the model cannot simply choose to bias P1′s probability up as that would worsen the current prediction’s loss in expectation. In order for P1 to be preferred against local pressure, the future expected benefit must swamp the local influence.
Under what circumstances can that occur? In offline training, the model isn’t consuming its own outputs. It’s being calibrated against an external ground truth. Biasing up P1 isn’t helpful during training, because making the future easier to predict in an autoregressive context doesn’t reduce loss at training time. Matching the distribution of the input does. In this context, a bias in prediction is almost certainly from undertraining or a lack of capability.
(The other options would tend to be esoteric stuff like “the model is extremely strong to the point of simulated agents managing a reflective gradient hacking attempt of some sort,” which doesn’t seem to be an obvious/natural/necessary outcome for all implementations.)
In other words, the manner in which GPT-like architectures lack myopia is that the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately.
There’s probably some existing terminology for this split that I don’t know about, but I might call it something like myopic perception versus myopic action. GPT-likes do not have myopic perception, but they do have myopic action, and myopic action is the important one.
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
That’s an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream’s transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.
Thanks for posting this! I agree that it’s good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.
Some responses:
I think that the main value of the simulators framing was to push back against confused claims that treat (base) GPT3 and other generative models as traditional rational agents. That being said, I do think there are some reasons why the simulator framework adds value relative to “the model is doing next token prediction”:
The simulator framework incorporates specific facts about the token prediction task. We train generative models on tokens from a variety of agents, as opposed to a single unitary agent a la traditional behavior cloning. Therefore, we should expect different behaviors when the context implies that different agents are “natural”. In other words,
The simulator framework pushes back against “stochastic parrot” claims. In academia or on ML twitter (or, even more so, academic ML twitter), you often encounter claims that language models are “just” stochastic parrots—i.e. they don’t have “understanding” or “grounding”. My guess is this comes from experience with earlier generations of language models, especially early n-gram/small HMM models that really do lack understanding or grounding. (This is less of a thing that happens on LW/AF.) The simulator framework provides a mechanistic model for how a sophisticated language model that does well on next token prediction task, could end up developing a complicated world model and agentic behavior.
My guess is you have a significantly more sophisticated, empirical model of LMs, such that the simulators framework feels like a simplification to your empirical knowledge + “the model is doing next token prediction”. But I think the simulator framework is valuable because it incorporates additional knowledge about the LM task while pushing back against two significantly more confused framings. (Indeed, Janus makes these claims explicitly in the simulators post!)
Are you thinking of A naive alignment strategy and optimism about generalization?
(Paul does talk about intended vs unintended generalization in a bunch of posts, so it’s conceivable you’re thinking about something more specific.)
I do think people think variants of this, see the comments of Steering Behaviour: Testing for (Non-)Myopia in Language Models for example.
I’m pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?
I think this mainly comes up in person with people who’ve just read the intro AI Safety materials, but one example on LW is What exactly is GPT-3′s base objective?.
I think this is referring to something I said, so I’m going to clarify my stance here.
First, I’m pretty sure on reading this section now that I misunderstood what you were pointing at then. Instead of:
I was picturing something like the other hallucinations you mention, specifically:
(In retrospect this seems like a pretty uncharitable take on something anyone with a lot of experience with language models would find a problem. My guess is that at the time I was spending too much time thinking about what you were saying looked like in terms of my existing ontology and what I would have expected to happen, and not enough on actually making sure I understood what you were pointing at).
Second, I’m not fully convinced that this is qualitatively different from other types of hallucinations, except in that they’re plausibly easier to fix because RLHF can do weird things specifically to prompt interactions (then again, I’m not sure whether you’re actually claiming it’s qualitatively different either, in which case this is just a thought dump). If you prompted GPT with an article on coffee and ended it with a question about what the article says about Hogwarts, the conditional you want is one where someone wrote an article about coffee and where someone else’s immediate follow-up is to ask what it says about Hogwarts.
But this is outweighed on the model’s prior because it’s not something super likely to happen in our world. In other words, the conditional of “the prompt is exactly right and contains the entire content to be used for answering the question” isn’t likely enough relative to other potential conditionals like “the prompt contains the title of the blog post, and the rest of the post was left out” (for the example in the post) or “the context changed suddenly and the question should be answered from the prior” or “questions about the post can be answered using knowledge from outside the post as well” or something else that’s weird because the intended conditional is unlikely enough to allow for it (for the Hogwarts example).
Put that way this just sounds like it’s quantitatively different from other hallucinations, in that information in the prompt is be a stronger way to influence the posterior you get from conditioning. And this can allow us a greater degree of control, but I don’t see the model as doing fundamentally different things here as opposed to other cases.
I’m not entirely sure this is what they believe, but I think the reason this framing gets thrown around a lot is that it’s a pretty evocative way to reason about the model’s behaviour. Specifically, I would be pretty surprised if anyone thought this was literally true in the sense of modelling very low-level features of reality, and didn’t just use it as a useful way to talk about GPT mechanics like time evolution over some learned underlying mechanics, and to draw inspiration from the analogy.
I agree with this. But while again I’m not entirely sure what Janus would say, I think their interactions with GPT involve a fair degree of human input on long simulations, either in terms of where to prune / focus, or explicit changes to the prompt. (There are some desirable properties we get from a relaxed degree of influence, like story “threads” created much earlier ending up resolving themselves in very unexpected ways much later in the generation stream by GPT, as if that was always the intention.)
Echoing porby’s comment, I don’t find the kind of narrow myopia this breaks to be very concerning.
I agree with this being a problem, but I didn’t get the same impression from the simulators post (albeit I’d heard of the ideas earlier so this may be on the post) - my takeaway was just that there’s a large conceptual gulf between what we ascribe to the model and its simulacra, not that there’s a gulf in model space between pure generative models and a non-simulator (I actually talk about this problem in an older post, which they were a large influence on).
Regarding the section on hallucinations—I am confused why the example prompt is considered a hallucination. It would, in fact, have fooled me—if I were given this input:
I would assume that I was supposed to invent what the blog post contained, since the input only contains what looks like a title. It seems entirely reasonable the AI would do the same, without some sort of qualifier, like “The following is the entire text of a blog post about large language models.”
Yeah! That’s related to what Beth says in a later paragraph:
And I think it’s a reasonable task for the model to do. I also think what you said is an uncontroversial, relatively standard explanation for why the model exhibits this behavior.
In modern LM parlance, “hallucination” doesn’t needs to be something humans get right, nor something that is unreasonable for the AI to get wrong. The specific reason this is considered a hallucination is because people often want to use LMs for text-based question answering or summarization, and making up content is pretty undesirable for that kind of task.
Thanks for clarifying!
So, in that case:
What exactly is a hallucination?
Are hallucinations sometimes desirable?
I don’t think there’s an agreed upon definition of hallucination, but if I had to come up with one, it’s “making inferences that aren’t supported by the prompt, when the prompt doesn’t ask for it”.
The reason why the boundary around “hallucination” is fuzzy is because language models constantly have to make inferences that aren’t “in the text” from a human perspective, a bunch of which are desirable. E.g. the language model should know facts about the world, or be able to tell realistic stories when prompted.