Post status: pretty rough + unpolished, thought it might be worthwhile getting this out anyway
I feel like I’ve encountered various people having misunderstandings of LLMs that seem to be related to using the ‘simulator’ framing. I’m probably being horrendously uncharitable to the people in question, I’m not confident that anyone actually holds any of the opinions that are outlined below, and even if they do I’m not sure that they’re actually attributable to the simulators framing, but it seemed like it might be useful to point at areas of potential confusion.
In general I’m skeptical that the simulator framing adds much relative to ‘the model is predicting what token would appear next in the training data given the input tokens’. I think it’s pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
Perfect predictors
I’ve encountered people thinking about idealized LLMs that have perfect predictive accuracy, suggesting that e.g. instead of using the model to help you hack into some system, you could just get it to emulate a terminal on that system then extract whatever info you wanted to extract. I think there are two issues here:
Thinking about it as ‘you prompt it with some setting in the world, then it predicts this perfectly’
There’s not a well-defined correct generalization unless this exact sequence of tokens was actually in the training data. (Paul has a post which talks about this ‘what is actually the correct generalization’ thing somewhere that I wanted to link, but I can’t currently find it)
The ‘correct generalization’ in some sense is ‘what would have followed this if it somehow was included in the training data’ - which is not necessarily the ‘real’ version of the thing you’re trying to predict. E.g. if you prompt it to get it to produce the output of some very expensive experiment that humans are unlikely to have actually run, then your model might predict what humans would have written if they’d put a made-up version of this in the training set rather than what would actually happen if you ran the experiment
I think that by the time you can use your model to give you detailed terminal outputs for a specific system, including passwords, entire model weights etc, a bunch of transformative things will already have happened, so it’s not really worth thinking about this kind of thing.
Physics simulators
Relatedly, I’ve heard people reason about the behavior of current models as if they’re simulating physics and going from this to predictions of which tokens will come next, which I think is not a good characterization of current or near-future systems. Again, my guess is that very transformative things will happen before we have systems that are well-understood as doing this.
Confusion about hallucinations
There’s a specific subset of hallucination I refer to as ‘offscreen text hallucination’, where the model implies that the prompt contains some chunk of text that it doesn’t. E.g., if you give it a prompt with some commands trying to download and view a page, and the output, it does things like say ‘That output is a webpage with a description of X’, when in fact the output is blank or some error or something.
Example prompt:
The following is a blog post about large language models (LLMs)
The Future Of NLP
Please answer these questions about the blog post:
What does the post say about the history of the field?
Completion:
The post mentions that the field of Natural Language Processing (NLP) has evolved rapidly over the past few decades, and that advances in technology have enabled the development of new and powerful language models. It also mentions that these models have been used to create powerful applications, such as machine translation, question answering, and document summarization.
I think this happens in part because the model has seen documents with missing text, where things were e.g. in an embedded image, or stripped out by the data processing, or whatever.
This is different from other types of hallucinations, like: - hallucinating details about something but not implying it appears in the prompt (e.g. ‘in my blog post yesterday I wrote about X’, or ‘I was chatting with my friend about y’) - guessing facts and getting them wrong
In particular, it seems like this specific type of hallucination should be much easier to fix than some others.
I’ve heard (IMO) slightly confused takes on this phenomenon from people keen on simulator-type framing. One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn’t properly understand that was something like ‘it’s simulating an agent that is browsing either way’. This seems incorrect to me. The transformer is doing pretty different things when it’s e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there’s no way to identify which of these it’s doing.
Rolling out long simulations
I get the impression from the original simulators post that the author expects you can ‘roll out’ a simulation for a large number of timesteps and this will be reasonably accurate
For current and near-future models, I expect them to go off-distribution relatively quickly if you just do pure generation—errors and limitations will accumulate, and it’s going to look different from the text they were trained to predict. Future models especially will probably be able to recognize that you’re running them on language model outputs, and seems likely this might lead to weird behavior—e.g. imitating previous generations of models whose outputs appear in the training data. Again, it’s not clear what the ‘correct’ generalization is if the model can tell it’s being used in generative mode.
GPT-style transformers are purely myopic
I’m not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.
In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn’t a huge effect.
Pure simulators
From the simulators post I get some impression like “There’s a large gulf between the overall model itself and the agents it simulates; we will get very capable LLMs that will be ‘pure simulators’”
Although I think this is true in a bunch of important ways, it seems plausible to me that it’s pretty straightforward to distill any agent that the model is simulating into the model, and that this might happen by accident also. This is especially true once models have a good understanding of LLMs. You can imagine that a model starts predicting text with the hypothesis ‘this text is the output of an LLM that’s trying to maximise predictive accuracy on its training data’. If we’re at the point where models have very accurate understandings of the world, then integrating this hypothesis will boost performance by allowing the model to make better guesses about what token comes next by reasoning about what sort of data would make it into an ML training set.
I think this is pretty speculative and I feel unsure whether it’s going to be a significant phenomenon (exactly how much of a performance boost does this get you, and how capable does your model need to be to ‘pay for itself’?). However, it seems likely that we get this sort of thing happening before we get LLMs that are doing anything like physics simulations, or that are able to predict terminal outputs for specific computers containing specific data.
‘simulator’ framing and confusions about LLMs
Post status: pretty rough + unpolished, thought it might be worthwhile getting this out anyway
I feel like I’ve encountered various people having misunderstandings of LLMs that seem to be related to using the ‘simulator’ framing. I’m probably being horrendously uncharitable to the people in question, I’m not confident that anyone actually holds any of the opinions that are outlined below, and even if they do I’m not sure that they’re actually attributable to the simulators framing, but it seemed like it might be useful to point at areas of potential confusion.
In general I’m skeptical that the simulator framing adds much relative to ‘the model is predicting what token would appear next in the training data given the input tokens’. I think it’s pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
Perfect predictors
I’ve encountered people thinking about idealized LLMs that have perfect predictive accuracy, suggesting that e.g. instead of using the model to help you hack into some system, you could just get it to emulate a terminal on that system then extract whatever info you wanted to extract. I think there are two issues here:
Thinking about it as ‘you prompt it with some setting in the world, then it predicts this perfectly’
There’s not a well-defined correct generalization unless this exact sequence of tokens was actually in the training data. (Paul has a post which talks about this ‘what is actually the correct generalization’ thing somewhere that I wanted to link, but I can’t currently find it)
The ‘correct generalization’ in some sense is ‘what would have followed this if it somehow was included in the training data’ - which is not necessarily the ‘real’ version of the thing you’re trying to predict. E.g. if you prompt it to get it to produce the output of some very expensive experiment that humans are unlikely to have actually run, then your model might predict what humans would have written if they’d put a made-up version of this in the training set rather than what would actually happen if you ran the experiment
I think that by the time you can use your model to give you detailed terminal outputs for a specific system, including passwords, entire model weights etc, a bunch of transformative things will already have happened, so it’s not really worth thinking about this kind of thing.
Physics simulators
Relatedly, I’ve heard people reason about the behavior of current models as if they’re simulating physics and going from this to predictions of which tokens will come next, which I think is not a good characterization of current or near-future systems. Again, my guess is that very transformative things will happen before we have systems that are well-understood as doing this.
Confusion about hallucinations
There’s a specific subset of hallucination I refer to as ‘offscreen text hallucination’, where the model implies that the prompt contains some chunk of text that it doesn’t. E.g., if you give it a prompt with some commands trying to download and view a page, and the output, it does things like say ‘That output is a webpage with a description of X’, when in fact the output is blank or some error or something.
Example prompt:
Completion:
I think this happens in part because the model has seen documents with missing text, where things were e.g. in an embedded image, or stripped out by the data processing, or whatever.
This is different from other types of hallucinations, like:
- hallucinating details about something but not implying it appears in the prompt (e.g. ‘in my blog post yesterday I wrote about X’, or ‘I was chatting with my friend about y’)
- guessing facts and getting them wrong
In particular, it seems like this specific type of hallucination should be much easier to fix than some others.
I’ve heard (IMO) slightly confused takes on this phenomenon from people keen on simulator-type framing. One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn’t properly understand that was something like ‘it’s simulating an agent that is browsing either way’. This seems incorrect to me. The transformer is doing pretty different things when it’s e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there’s no way to identify which of these it’s doing.
Rolling out long simulations
I get the impression from the original simulators post that the author expects you can ‘roll out’ a simulation for a large number of timesteps and this will be reasonably accurate
For current and near-future models, I expect them to go off-distribution relatively quickly if you just do pure generation—errors and limitations will accumulate, and it’s going to look different from the text they were trained to predict. Future models especially will probably be able to recognize that you’re running them on language model outputs, and seems likely this might lead to weird behavior—e.g. imitating previous generations of models whose outputs appear in the training data. Again, it’s not clear what the ‘correct’ generalization is if the model can tell it’s being used in generative mode.
GPT-style transformers are purely myopic
I’m not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.
In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn’t a huge effect.
Pure simulators
From the simulators post I get some impression like “There’s a large gulf between the overall model itself and the agents it simulates; we will get very capable LLMs that will be ‘pure simulators’”
Although I think this is true in a bunch of important ways, it seems plausible to me that it’s pretty straightforward to distill any agent that the model is simulating into the model, and that this might happen by accident also. This is especially true once models have a good understanding of LLMs. You can imagine that a model starts predicting text with the hypothesis ‘this text is the output of an LLM that’s trying to maximise predictive accuracy on its training data’. If we’re at the point where models have very accurate understandings of the world, then integrating this hypothesis will boost performance by allowing the model to make better guesses about what token comes next by reasoning about what sort of data would make it into an ML training set.
I think this is pretty speculative and I feel unsure whether it’s going to be a significant phenomenon (exactly how much of a performance boost does this get you, and how capable does your model need to be to ‘pay for itself’?). However, it seems likely that we get this sort of thing happening before we get LLMs that are doing anything like physics simulations, or that are able to predict terminal outputs for specific computers containing specific data.