I’ve been thinking about this post a lot since it first came out. Overall, I think it’s core thesis is wrong, and I’ve seen a lot of people make confident wrong inferences on the basis of it.
The core problem with the post was covered by Eliezer’s post “GPTs are Predictors, not Imitators” (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):
Imagine yourself in a box, trying to predict the next word—assign as much probability mass to the next token as possible—for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don’t have an answer, maybe take a minute to generate one, or alternatively, try to predict what I’ll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn’t going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator’s next token.
Indeed, in general, you’ve got to be more intelligent to predict particular X, than to generate realistic X. GPTs are being trained to a much harder task than GANs.
Same spirit: <Hash, plaintext> pairs, which you can’t predict without cracking the hash algorithm, but which you could far more easily generate typical instances of if you were trying to pass a GAN’s discriminator about it (assuming a discriminator that had learned to compute hash functions).
The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.
To me, the key feature of a “simulator” would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the “physics” of the system) and then applying that transition function over and over again until your desired target time.
I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don’t see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose.
When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that question is purely rooted in simulating the mind that generated the hash and pre-image, or the question it is being asked. There will probably be some simulation going on, but a lot of what’s going on is just straightforward problem-solving of the problems that seem necessary to predict the next tokens successfully, many of which will not correspond to simulating the details of the process that generated those tokens (in the case of a hash followed by a pre-image, the humans that generated that tuple of course had access to the pre-image first, and then hashed it, and then just reversed the order in which they pasted it into the text, making this talk practically impossible to solve if you structure your internal cognition as a simulation of any kind of system).
This post is long, and I might have misunderstood it, and many people I talked to keep referencing this post as something that successfully gets some kind of important intuition across, but when I look at the concrete statements and predictions made by this post, I don’t see how it holds up to scrutiny, though it is still plausible to me that there is some bigger image being painted that does help people understand some important things better.
I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don’t work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than “simulators.” But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).
Chapter 10 of Bostrom’s Superintelligence (2014) is titled, “Oracles, Genies, Sovereigns, Tools”. As the “Inadequate Ontologies” section of this post points out, language models (as they are used and heralded as proto-AGI) aren’t any of those things. (The Claude or ChatGPT “assistant” character is, well, a simulacrum, not “the AI itself”; it’s useful to have the word simulacrum for this.)
This is a big deal! Someone whose story about why we’re all going to die was limited to, “We were right about everything in 2014, but then there was a lot of capabilities progress,” would be willfully ignoring this shocking empirical development (which doesn’t mean we’re not all going to die, but it could be for somewhat different reasons).
repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true [...] particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose
Call it a “prediction objective”, then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn’t).
If you think of gradient descent as an attempted “utility function transfer” (from loss function to trained agent) that doesn’t really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.
But why would you think of SGD that way? That’s not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.
If you don’t see why “other loss functions one could choose” aren’t as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?
Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
> A simulation is the imitation of the operation of a real-world process or system over time.
[...]
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
[...]
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
[...]
Well, typically, we avoid getting confused by recognizing a distinction between the laws of physics, which apply everywhere at all times, and spatiotemporally constrained things which evolve according to physics, which can have contingent properties such as caring about a goal.
[...]
Below is a table which compares various simulator-like things to the type of simulator that GPT exemplifies on some quantifiable dimensions. The following properties all characterize GPT:
Generates rollouts: The model naturally generates rollouts, i.e. serves as a time evolution operator
[...]
Not only does the supervised/oracle perspective obscure the importance and limitations of prompting, it also obscures one of the most crucial dimensions of GPT: the implicit time dimension. By this I mean the ability to evolve a process through time by recursively applying GPT, that is, generate text of arbitrary length.
[...]
This resulting policy is capable of animating anything that evolves according to that rule: a far larger set than the sampled trajectories included in the training data, just as there are many more possible configurations that evolve according to our laws of physics than instantiated in our particular time and place and Everett branch.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
the extremely myopic prediction objective that they optimize
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.
I think a lot of what is causing confusion here is the word ‘simulation’. People often talk colloquially about “running a weather simulation” or “simulating an aircraft’s wing under stress”. This is a common misnomer, technically the correct word they should be using there is ‘emulation’. If you are running a detailed analysis of each subprocess that matters and combining the all their interactions together to produce a detail prediction, then you are ‘emulating’ something. On the other hand, if you’re doing something that more resembles a machine learning model pragmatically leaning its behavior (what one could even call a stochastic parrot), trained to predict the same outcomes over some large set of sample situations, then you’re running a ‘simulation’.
As janus writes:
Self-supervised ML can create “behavioral” simulations of impressive semantic fidelity. Whole brain emulation is not necessary to construct convincing and useful virtual humans; it is conceivable that observations of human behavioral traces (e.g. text) are sufficient to reconstruct functionally human-level virtual intelligence.
So he is clearly and explicitly making this distinction between the words ‘simulation’ and ‘emulation’, and evidently understands the correct usage of each of them. To pick a specific example, the weather models that most government’s meteorological departments run are emulations that divide the entire atmosphere (or the part near that country) into a great many small cells and emultate the entire system (except at the level of the smallest cells, where they fall back on simulation since they cannot afford to further subdivide the problem, as the physics of turbulence would otherwise require); whereas the (vastly more computationally efficient) GraphCast system that DeepMind recently built is a simulation. It basically relies on the weather continuing to act in the future in ways it has in the past (so potentially could be thrown off by effects like global warming). So Simulator Theory is saying “LLMS work like GraphCast makes weather predictions” not “LLMs work like detailed models of the atmosphere split into a vast number of tiny cells make weather predictions”.
[The fact that this is even possible in non-linear systems is somewhat surprising, as janus is expressing in the quote above, but then Science has often managed to find useful regularities in the behavior of very large systems, ones that that do not require mechanistically breaking their behavior down all the way to individual fundamental particles to model them. Most behavior most of the time is not in fact NP-complete, and has Lyapunov times much longer than the periods between interactions of its constituent fundamental particles — so clearly often a lot of the fine details wash out. Apparently this is also true of the human brain, unlike the case for computers]
So the “Simulator Theory” is not an “Emulator Theory”. Janus is explicitly not claiming that an LLM “perfectly [emulates] earth, including the creation of its own training data set”. Any fan of Simulator Theory who make claims like that has not correctly understood it (most likely due to this common confusion over the meaning of the word ‘simulate’). The claim in the Simulation Thesis is that the ML model finds and learns regularities in its training set, and them reapplies them in a way that makes (quite good) predictions, without doing a detailed emulation of the process it is predicting, in just the same way that GraphCast makes weather predictions without (and far more computationally cheaply than) emulating the entire atmosphere. (Note that this claim is entirely uncontroversial: that’s exactly what machine learning models always do when they work.) So the LLM has internal world models, but they are models of the behavior of parts of the world, not of the detailed underlying physical process that produces that behavior. Also note that while such models can sometimes correctly extrapolate outside the training distribution, this requires luck: specifically that no new phenomena become important to the behavior outside the training distribution that weren’t learnable from the behavior inside it. The risk of this being false increases the more complex the underlying system and further you attempt to extrapolate outside the training distribution.
I would actually be curious about having a dialogue with anyone who disagrees with the review above. It seems like this post had a large effect on people, and I would like there to be a proper review of it, so having two people have a debate about its merits seems like a decent format to me.
I can at least give you the short version of why I think you’re wrong, if you want to chat lmk I guess.
Plain text: “GPT is a simulator.”
Correct interpretation: “Sampling from GPT to generate text is a simulation, where the state of the simulation’s ‘world’ is the text and GPT encodes learned transition dynamics between states of the text.”
Mistaken interpretation: “GPT works by doing a simulation of the process that generated the training data. To make predictions, it internally represents the physical state of the Earth, and predicts the next token by applying learned transition dynamics to the represented state of the Earth to get a future state of the Earth.”
-
So that’s the “core thesis.” Maybe it would help to do the same thing for some of the things you might use the simulator framing for?
Plain text: “GPT can simulate a lot of different humans.”
Correct interpretation: “The text dynamics of GPT can support long-lived dynamical processes that write text like a lot of different humans. This is a lot like how a simulation of the solar system could have a lot of different orbits depending on the initial condition, except the laws of text are a lot more complicated and anthropocentric than the laws of celestial mechanics.”
Mistaken interpretation: “When GPT is talking like a person, that’s because there is a sentient simulation of a person in there doing thinking that is then translated into words.”
Plain text: “Asking whether GPT knows some fact is the wrong question. It’s specific simulacra that know things.”
Correct interpretation: “The dynamical processes that get you human-like text out of GPT (‘simulacra’) can vary in how easy it is to get them to recite a desired fact. You might hope there’s some ‘neutral’ way to get a recitation of a fact out of GPT, but there is no such neutral way, it’s all dynamical processes. When it comes to knowing things, GPT is more like a compression algorithm than a person. It knows a fact well when that fact is the result of simple initial conditions.”
Drawback of the correct interpretation: Focuses imagination on text processes that play human roles, potentially obscuring more general ways to get to desired output text.
Mistaken interpretation: “Inside GPT’s simulation of the physical state of the Earth, it tracks what different people know.”
Plain text: “If you try to get GPT to solve hard problems, and it succeeds, it might be simulating a non-human intelligence.”
Correct interpretation: “GPT has learned text dynamics that include a lot of clever rules for getting correct answers, because it’s had to predict a lot of text that requires cleverness. A lot of those clever rules were learned to predict human text, and are interwoven with other heuristics that keep its state in the distribution of human text. But if it’s being clever in ways that humans aren’t, it’s probably going to leave the distribution of human text in other ways.”
Mistaken? interpretation: “If GPT starts getting good at reversing hashes, it’s about to break out of its server and start turning the Earth into well-predicted tokens.”
Sure, I wasn’t under the impression that the claim was that GPT was literally simulating earth, but I don’t understand how describing something as a simulation of this type, over a completely abstract “next token space” constraints expectations.
Like, I feel like you can practically define all even slightly recurrent systems as “simulators” of this type. If we aren’t talking about simulating something close to human minds, what predictions can we make?
Like, let’s say I have a very classical RL algorithm, something like AlphaZero with MCTS. It also “simulates” a game state by state into the future (into many different branches). But how does this help me predict what the system does? AlphaZero seems to share few of the relevant dynamics this post is talking about.
This is what all that talk about predictive loss was for. Training on predictive loss gets you systems that are especially well-suited to being described as learning the time-evolution dynamics of the training distribution. Not in the sense that they’re simulating the physical reality underlying the training distribution, merely in the sense that they’re learning dynamics for the behavior of the training data.
Sure, you could talk about AlphaZero in terms of prediction. But it’s not going to have the sort of configurability that makes the simulator framing so fruitful in the case of GPT (or in the case of computer simulations of the physical world). You can’t feed AlphaZero the first 20 moves of a game by Magnus Carlsen and have it continue like him.
Or to use a different example, one time talking about simulators is when someone asks “Does GPT know this fact?” because GPT’s dynamics are inhomogeneous—it doesn’t always act with the same quality of knowing the fact or not knowing it. But AlphaZero’s training process is actively trying to get rid of that kind of inhomogeneity—AlphaZero isn’t trained to mimic a training distribution, it’s trained to play high-scoring moves.
The simulator framing has no accuracy advantage over thinking directly in terms of next token prediction, except that thinking in terms of simulator and simulacra sometimes usefully compresses the relevant ideas, and so lets people think larger new thoughts at once. Probably useful for coming up with ChatGPT jailbreaks. Definitely useful for coming up with prompts for base GPT.
To add to Charlie’s point (which seems right to me):
As I understand things, I think we are talking about a simulation of something somewhat close to human minds—e.g. text behaviour of humanlike simulacra (made of tokens—but humans are made of atoms). There’s just no claim of an internal simulation.
I’d guess a common upside is to avoid constraining expectations unhelpfully in ways that [GPT as agent] might.
However, I do still worry about saying “GPT is a simulator” rather than something like “GPT currently produces simulations”. I think the former suggests too strongly that we understand something about what it’s doing internally—e.g. at least that it’s not inner misaligned, and won’t stop acting like a simulator at some future time (and can easily be taken to mean that it’s doing simulation internally).
If the aim is to get people thinking more clearly, I’d want it to be clearer that this is a characterization of [what GPTs currently output], not [what GPTs fundamentally are].
As I understand things, I think we are talking about a simulation of something somewhat close to human minds—e.g. text behaviour of humanlike simulacra (made of tokens—but humans are made of atoms). There’s just no claim of an internal simulation.
I mean, that is the exact thing that I was arguing against in my review.
I think the distribution of human text just has too many features that are hard to produce via simulating human-like minds. I agree that the system is trained on imitating human text, and that necessarily requires being able to roleplay as many different humans, but I don’t think the process of that roleplay is particularly likely to be akin to a simulation (similarly to how when humans roleplay as other humans they do a lot of cognition that isn’t simulation, i.e. when someone plays an actor in a movie they do things like explicitly thinking about the historical period in which they were set, they recognize that certain scenes will be hard to pull off, they solve a problem using the knowledge they have when not roleplaying and then retrofit their solution into something the character might have come up with, etc. When humans imitate things we are not limited to simulating the target of our imitation)
The cognitive landscape of an LLM is also very different from humans, and it seems clear that in many contexts the behavior of an LLM will generalize quite differently than it would for a human, and simulation again seems unlikely to be the only, or honestly even primary way, I expect an LLM to get good at human text imitation given that differing cognitive landscape).
Oh, hang on—are you thinking that Janus is claiming that GPT works by learning some approximation to physics, rather than ‘physics’?
IIUC, the physics being referred to is either through analogy (when it refers to real-world physics), or as a generalized ‘physics’ of [stepwise addition of tokens]. There’s no presumption of a simulation of physics (at any granularity).
E.g.:
Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples.
Apologies if I’m the one who’s confused :). This just seemed like a natural explanation for your seeming to think the post is claiming a lot more mechanistically. (I think it’s claiming almost nothing)
No, I didn’t mean to imply that. I understand that “physics” here is a general term for understanding how any system develops forward according to some abstract definition of time.
What I am saying is that even with a more expansive definition of physics, it seems unlikely to me that GPT internally simulates a human mind (or anything else really) in a way where structurally there is a strong similarity between the way a human brain steps forward in physical time, and the way the insides of the transformer generates additional tokens.
Sure, but I don’t think anyone is claiming that there’s a similarity between a brain stepping forward in physical time and transformer internals. (perhaps my wording was clumsy earlier)
IIUC, the single timestep in the ‘physics’ of the post is the generation and addition of one new token. I.e. GPT uses [some internal process] to generate a token. Adding the new token is a single atomic update to the “world state” of the simulation. The [some internal process] defines GPT’s “laws of physics”.
The post isn’t claiming that GPT is doing some generalized physics internally. It’s saying that [GPT(input_states) --> (output_states)] can be seen as defining the physical laws by which a simulation evolves.
As I understand it, it’s making almost no claim about internal mechanism.
Though I think “GPT is a simulator” is only intended to apply if its simulator-like behaviour robustly generalizes—i.e. if it’s always producing output according to the “laws of physics” of the training distribution (this is imprecise, at least in my head—I’m unclear whether Janus have any more precise criterion).
I don’t think the post is making substantive claims that disagree with [your model as I understand it]. It’s only saying: here’s a useful way to think about the behaviour of GPT.
An LLM is a simulation, a system statistically trained to try to predict the same distribution of outputs as a human writing process (which could be a single brain in near-real-time, or an entire Wikipedia community of them interacting over years). It is not a detailed physical emulation of either of these processes.
The simple fact that a human brain has O(1014) synapses and current LLMs only have up to O(1012) parameters makes it clear that it’s going to be a fairly rough simulation — I actuall find it pretty astonishing that we often get as good a simulation as we do out of a system that clearly has clearly orders of magnitude less computational complexity. Apparently. lot of aspects of human text generation aren’t so complex as to actually engage and require a large fraction of the entire computational capacity of the brain to get even a passable approximation to the output. Indeed, the LLM scaling laws give as a strong sense of how much, at an individual token-guessing level, the predictability of human text improves as you thrown more computational capacity and a larger training sample set at the problem, and the answer is logarithmic: doubling the product of computational capacity and dataset size produces a fixed amount of improvement in the perplexity measure.
I don’t disagree, but I don’t think that describing the process an LLM uses to generate a single token as a simulation is clarifying in this context.
I’m fairly sure the post is making no such claim, and I think it becomes a lot more likely that readers will have habryka’s interpretation if the word “simulation” is applied to LLM internals (and correctly conclude that this interpretation entails implausible claims). I think “predictor” or the like is much better here.
Unless I’m badly misunderstanding, the post is taking a time-evolution-of-a-system view of the string of tokens—not of LLM internals. I don’t think it’s claiming anything about what the internal LLM mechanism looks like.
I think janus is explicitly using the verb ‘simulate’ as opposed to ‘emulate’ because he is not making any claims about LLM internals (and indeed doesn’t think the internals, whatever they may be, include a detailed emulation), and I think that this careful distinction in terminology (which janus explicitly employs at one point in the post above, when discussing just this question, so is clearly familiar with) is sadly lost on many readers, who tend to assume that the two words mean the same thing since the word ‘simulate’ commonly misused to include ‘emulate’ — a mistake I’ve often made myself.
I agree that the word ‘predict’ would be less liable to this particular misundertanding, but I think it has some other downsides: you’d have to ask janus why he didn’t pick it.
So my claim is, if someone don’t understand why it’s called “Simulator Theory” as opposed to “Emulator Theory”, then haven’t correctly understood janus’ post. (And I have certainly seen examples of people who appear to think LLMs actually are emulators, of nearly unlimited power. For example, the ones who suggested just asking an LLM for the text of the most cited paper on AI Alignment from 2030, something that predicting correctly would require emulating a significant proportion of the world for about six years.)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I.e. the LLM acts on [string of tokens], to produce [extended string of tokens]. The simulation is the entire thing: the string of tokens changing over time according to the action of the LLM.
Saying “the LLM is a simulation” strongly suggests that a simulation process (i.e. “the imitation of the operation of a real-world process or system over time”) is occurring within the LLM internals.
Saying “GPT is a simulator” isn’t too bad—it’s like saying “The laws of physics are a simulator”. Loosely correct. Saying “GPT is a simulation” is like saying “The laws of physics are a simulation”, which is at least misleading—I’d say wrong.
In another context it might not be too bad. In this post simulation has been specifically described as “the imitation of the operation of a real-world process or system over time”. There’s no basis to think that the LLM is doing this internally.
Unless we’re claiming that it’s doing something like that internally, we can reasonably say “The LLM produces a simulation”, but not “The LLM is a simulation”.
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I guess (as an ex-physicist and long-time software engineer) I’m not really hung up about the fact that emulations are normally performed one timestep at a time, and simulations certainly can be, so didn’t see much need to make a linguistic distinction for it. But that’s fine, I don’t disagree. Yes, an emulation or (in applicable cases) simulation process will consist of a sequence of many timesteps, and an LLM predicting text similarly does so one token at a time sequentially (which may not, in fact, be the order that humans produced them, or consume them, though by default usually is — something that LLMs often have trouble with, presumably due to their fixed forward-pass computational capacity).
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
Suddenly their username makes sense! Thanks, duely noted.
Perhaps we’re talking past each other to a degree. I don’t disagree with what you’re saying. I think I’ve been unclear—or perhaps just saying almost vacuous things. I’m attempting to make a very weak claim (I think the post is also making no strong claim—not about internal mechanism, at least).
I only mean that the output can often be efficiently understood in terms of human characters (among other things). I.e. that the output is a simulation, and that human-like minds will be an efficient abstraction for us to use when thinking about such a simulation. Privileging hypotheses involving the dynamics of the outputs of human-like minds will tend to usefully constrain expectations.
Again, I’m saying something obvious here—perhaps it’s too obvious to you. The only real content is something like [thinking of the output as being a simulation including various simulacra, is likely to be less misleading than thinking of it as the response of an agent].
I do not mean to imply that the internal cognition of the model necessarily has anything simulation-like about it. I do not mean that individual outputs are produced by simulation. I think you’re correct that this is highly unlikely to be the most efficient internal mechanism to predict text.
Overall, I think the word “simulation” invites confusion, since it’s forever unclear whether we’re pointing at the output of a simulation process, or the internal structure of that process. Generally I’m saying: [add a token single token] : single simulation step—using the training distribution’s ‘physics’. [long string of tokens] : a simulation [process of generating a single token] : [highly unlikely to be a simulation]
I’m using ‘simulation’ as it’s used in the post [the imitation of the operation of a real-world process or system over time]. The real-world process is the production of the string of tokens.
I still think that referring to what the LLM does in one step as “a simulation” is at best misleading. “a prediction” seems accurate and not to mislead in the same way.
Ah, so again, you’re making the distinction that the process of generating a single token is just a single timestep of a simulation, rather than saying its highly unlikely to be an emulation (or even a single timestep of an emulation). With which I agree, though I don’t see it as a distinction inobvious enough that I’d expect many people to trip over it. (Perhaps my background is showing.)
OK, then we were talking rather at cross-purposes: thanks for explaining!
I think the main thing I’d point to is this section (where I’ve changed bullet points to numbers for easier reference):
I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:
The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.
It does not imply that the AI’s behavior is well-described (globally or locally) as expected utility maximization. An arbitrarily powerful/accurate simulation can depict arbitrarily hapless sims.
It does not imply that the AI is only capable of emulating things with direct precedent in the training data. A physics simulation, for instance, can simulate any phenomena that plays by its rules.
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
It emphasizes the role of the state in specifying and constructing the agent/process. The importance of prompt programming for capabilities is obvious if you think of the prompt as specifying a configuration that will be propagated forward in time.
It emphasizes the interactive nature of the model’s predictions – even though they’re “just text”, you can converse with simulacra, explore virtual environments, etc.
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
I think (2)-(8) are basically correct, (1) isn’t really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don’t think those claims are in this post.)
I also think it does in fact constrain expectations. Here’s a claim that I think this post points to: “To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do”. Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it’s a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.
(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF’d or instruction tuned models.)
Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things—to people who actually played with / thought about LLMs. But I’m not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.
To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on.
Like, the above sounds similar to me to “in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does”. Of course, that is approximately useless advice, since basically all you’ve done is describe the training setup of AlphaZero.
As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to “simulate”. I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).
To maybe go into a bit more depth on what wrong predictions I’ve seen people make on the basis of this post:
I’ve seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it’s quite hard to know whether base models have any situational awareness, though RLHF’d models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run)
I’ve seen people make strong predictions that LLM performance can’t become superhuman on various tasks, since it’s just simulating human cognition, including on tasks where LLMs now have achieved superhuman performance
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate, since there is no real-world process with as broad of a knowledge base that produces text like that, except LLMs themselves (maybe you are making claims that only apply to base models, but I both fail to see the relevance in that case since base models are basically irrelevant these days, and am skeptical about people making claims about LLM cognition that apply only to RLHF’d models and not the base models given that the vast majority of datapoints that shaped the LLMs cognition come from the base model and not the RLHF portion)
I’ve seen people say that because LLMs are just “simulators” that ultimately we can just scale them up as far as we want, and all we will get are higher-fidelity simulations of the process that created the training distribution, basically eliminating any risk from scaling with current architectures.
I think all of these predictions are pretty unwarranted, and some of them have been demonstrated to be false.
They also seem to me like predictions this post makes, and not just misunderstandings of people reading this post, but I am not sure. I am very familiar with the experience of other people asserting that a post makes predictions it is not making, because they observed someone who misunderstood the post and then made some bad predictions.
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn’t reread this post carefully but I don’t remember seeing mechanistic claims in it.
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on. [...]
I mostly agree and this is an aspect of what I mean by “this post says obvious and uncontroversial things”. I’m not particularly advocating for this post in the review; I didn’t find it especially illuminating.
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate
This seems somewhat in conflict with the previous quote?
Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn’t work for RLHF’d models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don’t have a strong opinion there.
I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we’re in the rare setting where text is being produced by an entity with those abilities.
Hmm, yeah, this perspective makes more sense to me, and I don’t currently believe you ended up making any of the wrong inferences I’ve seen others make on the basis of the post.
I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:
Broadly it views these models as simulating a learned distribution with various degrees of fidelity, which in the case of language models trained on a large corpus of text is the mechanics underlying our world.
This also directly claims that the physics the system learned are “the mechanics underlying our world”, which I think isn’t totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model’s cognition.
In general I believe that many (most?) people take it too far and make incorrect inferences—partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.
sometimes putting a name to what you “already know” makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment “it seems like you aren’t thinking of GPT as a simulator!”
I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn’t seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn’t agree with all of what was said, e.g. nostalgebraist doesn’t seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like “reasonable disagreement” than “wildly misled by simulator framing”.)
If one were to distingush between “behavioral simulators” and “procedural simulators”, the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they’re working well, base models clearly do the former, even as I suspect they don’t do the latter.
I’ve been thinking about this post a lot since it first came out. Overall, I think it’s core thesis is wrong, and I’ve seen a lot of people make confident wrong inferences on the basis of it.
The core problem with the post was covered by Eliezer’s post “GPTs are Predictors, not Imitators” (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):
The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.
To me, the key feature of a “simulator” would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the “physics” of the system) and then applying that transition function over and over again until your desired target time.
I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don’t see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose.
When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that question is purely rooted in simulating the mind that generated the hash and pre-image, or the question it is being asked. There will probably be some simulation going on, but a lot of what’s going on is just straightforward problem-solving of the problems that seem necessary to predict the next tokens successfully, many of which will not correspond to simulating the details of the process that generated those tokens (in the case of a hash followed by a pre-image, the humans that generated that tuple of course had access to the pre-image first, and then hashed it, and then just reversed the order in which they pasted it into the text, making this talk practically impossible to solve if you structure your internal cognition as a simulation of any kind of system).
This post is long, and I might have misunderstood it, and many people I talked to keep referencing this post as something that successfully gets some kind of important intuition across, but when I look at the concrete statements and predictions made by this post, I don’t see how it holds up to scrutiny, though it is still plausible to me that there is some bigger image being painted that does help people understand some important things better.
I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don’t work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than “simulators.” But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).
Chapter 10 of Bostrom’s Superintelligence (2014) is titled, “Oracles, Genies, Sovereigns, Tools”. As the “Inadequate Ontologies” section of this post points out, language models (as they are used and heralded as proto-AGI) aren’t any of those things. (The Claude or ChatGPT “assistant” character is, well, a simulacrum, not “the AI itself”; it’s useful to have the word simulacrum for this.)
This is a big deal! Someone whose story about why we’re all going to die was limited to, “We were right about everything in 2014, but then there was a lot of capabilities progress,” would be willfully ignoring this shocking empirical development (which doesn’t mean we’re not all going to die, but it could be for somewhat different reasons).
Call it a “prediction objective”, then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn’t).
If you think of gradient descent as an attempted “utility function transfer” (from loss function to trained agent) that doesn’t really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.
But why would you think of SGD that way? That’s not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.
If you don’t see why “other loss functions one could choose” aren’t as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?
Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
I think a lot of what is causing confusion here is the word ‘simulation’. People often talk colloquially about “running a weather simulation” or “simulating an aircraft’s wing under stress”. This is a common misnomer, technically the correct word they should be using there is ‘emulation’. If you are running a detailed analysis of each subprocess that matters and combining the all their interactions together to produce a detail prediction, then you are ‘emulating’ something. On the other hand, if you’re doing something that more resembles a machine learning model pragmatically leaning its behavior (what one could even call a stochastic parrot), trained to predict the same outcomes over some large set of sample situations, then you’re running a ‘simulation’.
As janus writes:
So he is clearly and explicitly making this distinction between the words ‘simulation’ and ‘emulation’, and evidently understands the correct usage of each of them. To pick a specific example, the weather models that most government’s meteorological departments run are emulations that divide the entire atmosphere (or the part near that country) into a great many small cells and emultate the entire system (except at the level of the smallest cells, where they fall back on simulation since they cannot afford to further subdivide the problem, as the physics of turbulence would otherwise require); whereas the (vastly more computationally efficient) GraphCast system that DeepMind recently built is a simulation. It basically relies on the weather continuing to act in the future in ways it has in the past (so potentially could be thrown off by effects like global warming). So Simulator Theory is saying “LLMS work like GraphCast makes weather predictions” not “LLMs work like detailed models of the atmosphere split into a vast number of tiny cells make weather predictions”.
[The fact that this is even possible in non-linear systems is somewhat surprising, as janus is expressing in the quote above, but then Science has often managed to find useful regularities in the behavior of very large systems, ones that that do not require mechanistically breaking their behavior down all the way to individual fundamental particles to model them. Most behavior most of the time is not in fact NP-complete, and has Lyapunov times much longer than the periods between interactions of its constituent fundamental particles — so clearly often a lot of the fine details wash out. Apparently this is also true of the human brain, unlike the case for computers]
So the “Simulator Theory” is not an “Emulator Theory”. Janus is explicitly not claiming that an LLM “perfectly [emulates] earth, including the creation of its own training data set”. Any fan of Simulator Theory who make claims like that has not correctly understood it (most likely due to this common confusion over the meaning of the word ‘simulate’). The claim in the Simulation Thesis is that the ML model finds and learns regularities in its training set, and them reapplies them in a way that makes (quite good) predictions, without doing a detailed emulation of the process it is predicting, in just the same way that GraphCast makes weather predictions without (and far more computationally cheaply than) emulating the entire atmosphere. (Note that this claim is entirely uncontroversial: that’s exactly what machine learning models always do when they work.) So the LLM has internal world models, but they are models of the behavior of parts of the world, not of the detailed underlying physical process that produces that behavior. Also note that while such models can sometimes correctly extrapolate outside the training distribution, this requires luck: specifically that no new phenomena become important to the behavior outside the training distribution that weren’t learnable from the behavior inside it. The risk of this being false increases the more complex the underlying system and further you attempt to extrapolate outside the training distribution.
I would actually be curious about having a dialogue with anyone who disagrees with the review above. It seems like this post had a large effect on people, and I would like there to be a proper review of it, so having two people have a debate about its merits seems like a decent format to me.
Maybe @janus, @Zack_M_Davis, @Charlie Steiner, @Joe_Collman?
I can at least give you the short version of why I think you’re wrong, if you want to chat lmk I guess.
Plain text: “GPT is a simulator.”
Correct interpretation: “Sampling from GPT to generate text is a simulation, where the state of the simulation’s ‘world’ is the text and GPT encodes learned transition dynamics between states of the text.”
Mistaken interpretation: “GPT works by doing a simulation of the process that generated the training data. To make predictions, it internally represents the physical state of the Earth, and predicts the next token by applying learned transition dynamics to the represented state of the Earth to get a future state of the Earth.”
-
So that’s the “core thesis.” Maybe it would help to do the same thing for some of the things you might use the simulator framing for?
Plain text: “GPT can simulate a lot of different humans.”
Correct interpretation: “The text dynamics of GPT can support long-lived dynamical processes that write text like a lot of different humans. This is a lot like how a simulation of the solar system could have a lot of different orbits depending on the initial condition, except the laws of text are a lot more complicated and anthropocentric than the laws of celestial mechanics.”
Mistaken interpretation: “When GPT is talking like a person, that’s because there is a sentient simulation of a person in there doing thinking that is then translated into words.”
Plain text: “Asking whether GPT knows some fact is the wrong question. It’s specific simulacra that know things.”
Correct interpretation: “The dynamical processes that get you human-like text out of GPT (‘simulacra’) can vary in how easy it is to get them to recite a desired fact. You might hope there’s some ‘neutral’ way to get a recitation of a fact out of GPT, but there is no such neutral way, it’s all dynamical processes. When it comes to knowing things, GPT is more like a compression algorithm than a person. It knows a fact well when that fact is the result of simple initial conditions.”
Drawback of the correct interpretation: Focuses imagination on text processes that play human roles, potentially obscuring more general ways to get to desired output text.
Mistaken interpretation: “Inside GPT’s simulation of the physical state of the Earth, it tracks what different people know.”
Plain text: “If you try to get GPT to solve hard problems, and it succeeds, it might be simulating a non-human intelligence.”
Correct interpretation: “GPT has learned text dynamics that include a lot of clever rules for getting correct answers, because it’s had to predict a lot of text that requires cleverness. A lot of those clever rules were learned to predict human text, and are interwoven with other heuristics that keep its state in the distribution of human text. But if it’s being clever in ways that humans aren’t, it’s probably going to leave the distribution of human text in other ways.”
Mistaken? interpretation: “If GPT starts getting good at reversing hashes, it’s about to break out of its server and start turning the Earth into well-predicted tokens.”
Sure, I wasn’t under the impression that the claim was that GPT was literally simulating earth, but I don’t understand how describing something as a simulation of this type, over a completely abstract “next token space” constraints expectations.
Like, I feel like you can practically define all even slightly recurrent systems as “simulators” of this type. If we aren’t talking about simulating something close to human minds, what predictions can we make?
Like, let’s say I have a very classical RL algorithm, something like AlphaZero with MCTS. It also “simulates” a game state by state into the future (into many different branches). But how does this help me predict what the system does? AlphaZero seems to share few of the relevant dynamics this post is talking about.
This is what all that talk about predictive loss was for. Training on predictive loss gets you systems that are especially well-suited to being described as learning the time-evolution dynamics of the training distribution. Not in the sense that they’re simulating the physical reality underlying the training distribution, merely in the sense that they’re learning dynamics for the behavior of the training data.
Sure, you could talk about AlphaZero in terms of prediction. But it’s not going to have the sort of configurability that makes the simulator framing so fruitful in the case of GPT (or in the case of computer simulations of the physical world). You can’t feed AlphaZero the first 20 moves of a game by Magnus Carlsen and have it continue like him.
Or to use a different example, one time talking about simulators is when someone asks “Does GPT know this fact?” because GPT’s dynamics are inhomogeneous—it doesn’t always act with the same quality of knowing the fact or not knowing it. But AlphaZero’s training process is actively trying to get rid of that kind of inhomogeneity—AlphaZero isn’t trained to mimic a training distribution, it’s trained to play high-scoring moves.
The simulator framing has no accuracy advantage over thinking directly in terms of next token prediction, except that thinking in terms of simulator and simulacra sometimes usefully compresses the relevant ideas, and so lets people think larger new thoughts at once. Probably useful for coming up with ChatGPT jailbreaks. Definitely useful for coming up with prompts for base GPT.
To add to Charlie’s point (which seems right to me):
As I understand things, I think we are talking about a simulation of something somewhat close to human minds—e.g. text behaviour of humanlike simulacra (made of tokens—but humans are made of atoms). There’s just no claim of an internal simulation.
I’d guess a common upside is to avoid constraining expectations unhelpfully in ways that [GPT as agent] might.
However, I do still worry about saying “GPT is a simulator” rather than something like “GPT currently produces simulations”.
I think the former suggests too strongly that we understand something about what it’s doing internally—e.g. at least that it’s not inner misaligned, and won’t stop acting like a simulator at some future time (and can easily be taken to mean that it’s doing simulation internally).
If the aim is to get people thinking more clearly, I’d want it to be clearer that this is a characterization of [what GPTs currently output], not [what GPTs fundamentally are].
I mean, that is the exact thing that I was arguing against in my review.
I think the distribution of human text just has too many features that are hard to produce via simulating human-like minds. I agree that the system is trained on imitating human text, and that necessarily requires being able to roleplay as many different humans, but I don’t think the process of that roleplay is particularly likely to be akin to a simulation (similarly to how when humans roleplay as other humans they do a lot of cognition that isn’t simulation, i.e. when someone plays an actor in a movie they do things like explicitly thinking about the historical period in which they were set, they recognize that certain scenes will be hard to pull off, they solve a problem using the knowledge they have when not roleplaying and then retrofit their solution into something the character might have come up with, etc. When humans imitate things we are not limited to simulating the target of our imitation)
The cognitive landscape of an LLM is also very different from humans, and it seems clear that in many contexts the behavior of an LLM will generalize quite differently than it would for a human, and simulation again seems unlikely to be the only, or honestly even primary way, I expect an LLM to get good at human text imitation given that differing cognitive landscape).
Oh, hang on—are you thinking that Janus is claiming that GPT works by learning some approximation to physics, rather than ‘physics’?
IIUC, the physics being referred to is either through analogy (when it refers to real-world physics), or as a generalized ‘physics’ of [stepwise addition of tokens]. There’s no presumption of a simulation of physics (at any granularity).
E.g.:
Apologies if I’m the one who’s confused :).
This just seemed like a natural explanation for your seeming to think the post is claiming a lot more mechanistically. (I think it’s claiming almost nothing)
No, I didn’t mean to imply that. I understand that “physics” here is a general term for understanding how any system develops forward according to some abstract definition of time.
What I am saying is that even with a more expansive definition of physics, it seems unlikely to me that GPT internally simulates a human mind (or anything else really) in a way where structurally there is a strong similarity between the way a human brain steps forward in physical time, and the way the insides of the transformer generates additional tokens.
Sure, but I don’t think anyone is claiming that there’s a similarity between a brain stepping forward in physical time and transformer internals. (perhaps my wording was clumsy earlier)
IIUC, the single timestep in the ‘physics’ of the post is the generation and addition of one new token.
I.e. GPT uses [some internal process] to generate a token.
Adding the new token is a single atomic update to the “world state” of the simulation.
The [some internal process] defines GPT’s “laws of physics”.
The post isn’t claiming that GPT is doing some generalized physics internally.
It’s saying that [GPT(input_states) --> (output_states)] can be seen as defining the physical laws by which a simulation evolves.
As I understand it, it’s making almost no claim about internal mechanism.
Though I think “GPT is a simulator” is only intended to apply if its simulator-like behaviour robustly generalizes—i.e. if it’s always producing output according to the “laws of physics” of the training distribution (this is imprecise, at least in my head—I’m unclear whether Janus have any more precise criterion).
I don’t think the post is making substantive claims that disagree with [your model as I understand it]. It’s only saying: here’s a useful way to think about the behaviour of GPT.
An LLM is a simulation, a system statistically trained to try to predict the same distribution of outputs as a human writing process (which could be a single brain in near-real-time, or an entire Wikipedia community of them interacting over years). It is not a detailed physical emulation of either of these processes.
The simple fact that a human brain has O(1014) synapses and current LLMs only have up to O(1012) parameters makes it clear that it’s going to be a fairly rough simulation — I actuall find it pretty astonishing that we often get as good a simulation as we do out of a system that clearly has clearly orders of magnitude less computational complexity. Apparently. lot of aspects of human text generation aren’t so complex as to actually engage and require a large fraction of the entire computational capacity of the brain to get even a passable approximation to the output. Indeed, the LLM scaling laws give as a strong sense of how much, at an individual token-guessing level, the predictability of human text improves as you thrown more computational capacity and a larger training sample set at the problem, and the answer is logarithmic: doubling the product of computational capacity and dataset size produces a fixed amount of improvement in the perplexity measure.
I don’t disagree, but I don’t think that describing the process an LLM uses to generate a single token as a simulation is clarifying in this context.
I’m fairly sure the post is making no such claim, and I think it becomes a lot more likely that readers will have habryka’s interpretation if the word “simulation” is applied to LLM internals (and correctly conclude that this interpretation entails implausible claims).
I think “predictor” or the like is much better here.
Unless I’m badly misunderstanding, the post is taking a time-evolution-of-a-system view of the string of tokens—not of LLM internals.
I don’t think it’s claiming anything about what the internal LLM mechanism looks like.
I think janus is explicitly using the verb ‘simulate’ as opposed to ‘emulate’ because he is not making any claims about LLM internals (and indeed doesn’t think the internals, whatever they may be, include a detailed emulation), and I think that this careful distinction in terminology (which janus explicitly employs at one point in the post above, when discussing just this question, so is clearly familiar with) is sadly lost on many readers, who tend to assume that the two words mean the same thing since the word ‘simulate’ commonly misused to include ‘emulate’ — a mistake I’ve often made myself.
I agree that the word ‘predict’ would be less liable to this particular misundertanding, but I think it has some other downsides: you’d have to ask janus why he didn’t pick it.
So my claim is, if someone don’t understand why it’s called “Simulator Theory” as opposed to “Emulator Theory”, then haven’t correctly understood janus’ post. (And I have certainly seen examples of people who appear to think LLMs actually are emulators, of nearly unlimited power. For example, the ones who suggested just asking an LLM for the text of the most cited paper on AI Alignment from 2030, something that predicting correctly would require emulating a significant proportion of the world for about six years.)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I.e. the LLM acts on [string of tokens], to produce [extended string of tokens].
The simulation is the entire thing: the string of tokens changing over time according to the action of the LLM.
Saying “the LLM is a simulation” strongly suggests that a simulation process (i.e. “the imitation of the operation of a real-world process or system over time”) is occurring within the LLM internals.
Saying “GPT is a simulator” isn’t too bad—it’s like saying “The laws of physics are a simulator”. Loosely correct.
Saying “GPT is a simulation” is like saying “The laws of physics are a simulation”, which is at least misleading—I’d say wrong.
In another context it might not be too bad. In this post simulation has been specifically described as “the imitation of the operation of a real-world process or system over time”. There’s no basis to think that the LLM is doing this internally.
Unless we’re claiming that it’s doing something like that internally, we can reasonably say “The LLM produces a simulation”, but not “The LLM is a simulation”.
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
I guess (as an ex-physicist and long-time software engineer) I’m not really hung up about the fact that emulations are normally performed one timestep at a time, and simulations certainly can be, so didn’t see much need to make a linguistic distinction for it. But that’s fine, I don’t disagree. Yes, an emulation or (in applicable cases) simulation process will consist of a sequence of many timesteps, and an LLM predicting text similarly does so one token at a time sequentially (which may not, in fact, be the order that humans produced them, or consume them, though by default usually is — something that LLMs often have trouble with, presumably due to their fixed forward-pass computational capacity).
Suddenly their username makes sense! Thanks, duely noted.
Perhaps we’re talking past each other to a degree. I don’t disagree with what you’re saying.
I think I’ve been unclear—or perhaps just saying almost vacuous things. I’m attempting to make a very weak claim (I think the post is also making no strong claim—not about internal mechanism, at least).
I only mean that the output can often be efficiently understood in terms of human characters (among other things). I.e. that the output is a simulation, and that human-like minds will be an efficient abstraction for us to use when thinking about such a simulation. Privileging hypotheses involving the dynamics of the outputs of human-like minds will tend to usefully constrain expectations.
Again, I’m saying something obvious here—perhaps it’s too obvious to you. The only real content is something like [thinking of the output as being a simulation including various simulacra, is likely to be less misleading than thinking of it as the response of an agent].
I do not mean to imply that the internal cognition of the model necessarily has anything simulation-like about it. I do not mean that individual outputs are produced by simulation. I think you’re correct that this is highly unlikely to be the most efficient internal mechanism to predict text.
Overall, I think the word “simulation” invites confusion, since it’s forever unclear whether we’re pointing at the output of a simulation process, or the internal structure of that process.
Generally I’m saying:
[add a token single token] : single simulation step—using the training distribution’s ‘physics’.
[long string of tokens] : a simulation
[process of generating a single token] : [highly unlikely to be a simulation]
Did you in fact mean ‘emulation’ for the last of those three items?
I’m using ‘simulation’ as it’s used in the post [the imitation of the operation of a real-world process or system over time]. The real-world process is the production of the string of tokens.
I still think that referring to what the LLM does in one step as “a simulation” is at best misleading. “a prediction” seems accurate and not to mislead in the same way.
Ah, so again, you’re making the distinction that the process of generating a single token is just a single timestep of a simulation, rather than saying its highly unlikely to be an emulation (or even a single timestep of an emulation). With which I agree, though I don’t see it as a distinction inobvious enough that I’d expect many people to trip over it. (Perhaps my background is showing.)
OK, then we were talking rather at cross-purposes: thanks for explaining!
I think the main thing I’d point to is this section (where I’ve changed bullet points to numbers for easier reference):
I think (2)-(8) are basically correct, (1) isn’t really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don’t think those claims are in this post.)
I also think it does in fact constrain expectations. Here’s a claim that I think this post points to: “To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do”. Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it’s a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.
(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF’d or instruction tuned models.)
Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things—to people who actually played with / thought about LLMs. But I’m not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on.
Like, the above sounds similar to me to “in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does”. Of course, that is approximately useless advice, since basically all you’ve done is describe the training setup of AlphaZero.
As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to “simulate”. I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).
To maybe go into a bit more depth on what wrong predictions I’ve seen people make on the basis of this post:
I’ve seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it’s quite hard to know whether base models have any situational awareness, though RLHF’d models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run)
I’ve seen people make strong predictions that LLM performance can’t become superhuman on various tasks, since it’s just simulating human cognition, including on tasks where LLMs now have achieved superhuman performance
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate, since there is no real-world process with as broad of a knowledge base that produces text like that, except LLMs themselves (maybe you are making claims that only apply to base models, but I both fail to see the relevance in that case since base models are basically irrelevant these days, and am skeptical about people making claims about LLM cognition that apply only to RLHF’d models and not the base models given that the vast majority of datapoints that shaped the LLMs cognition come from the base model and not the RLHF portion)
I’ve seen people say that because LLMs are just “simulators” that ultimately we can just scale them up as far as we want, and all we will get are higher-fidelity simulations of the process that created the training distribution, basically eliminating any risk from scaling with current architectures.
I think all of these predictions are pretty unwarranted, and some of them have been demonstrated to be false.
They also seem to me like predictions this post makes, and not just misunderstandings of people reading this post, but I am not sure. I am very familiar with the experience of other people asserting that a post makes predictions it is not making, because they observed someone who misunderstood the post and then made some bad predictions.
Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn’t reread this post carefully but I don’t remember seeing mechanistic claims in it.
I mostly agree and this is an aspect of what I mean by “this post says obvious and uncontroversial things”. I’m not particularly advocating for this post in the review; I didn’t find it especially illuminating.
This seems somewhat in conflict with the previous quote?
Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn’t work for RLHF’d models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don’t have a strong opinion there.
I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we’re in the rare setting where text is being produced by an entity with those abilities.
Hmm, yeah, this perspective makes more sense to me, and I don’t currently believe you ended up making any of the wrong inferences I’ve seen others make on the basis of the post.
I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:
This also directly claims that the physics the system learned are “the mechanics underlying our world”, which I think isn’t totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model’s cognition.
Yeah, agreed that’s a clear overclaim.
In general I believe that many (most?) people take it too far and make incorrect inferences—partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.
Fwiw I was sympathetic to nostalgebraist’s positive review saying:
I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn’t seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn’t agree with all of what was said, e.g. nostalgebraist doesn’t seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like “reasonable disagreement” than “wildly misled by simulator framing”.)
Yeah—I just noticed this ”...is the mechanics underlying our world.” on the tag page.
Agreed that it’s inaccurate and misleading.
I hadn’t realized it was being read this way.
If one were to distingush between “behavioral simulators” and “procedural simulators”, the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they’re working well, base models clearly do the former, even as I suspect they don’t do the latter.