An LLM is a simulation, a system statistically trained to try to predict the same distribution of outputs as a human writing process (which could be a single brain in near-real-time, or an entire Wikipedia community of them interacting over years). It is not a detailed physical emulation of either of these processes.
The simple fact that a human brain has O(1014) synapses and current LLMs only have up to O(1012) parameters makes it clear that it’s going to be a fairly rough simulation — I actuall find it pretty astonishing that we often get as good a simulation as we do out of a system that clearly has clearly orders of magnitude less computational complexity. Apparently. lot of aspects of human text generation aren’t so complex as to actually engage and require a large fraction of the entire computational capacity of the brain to get even a passable approximation to the output. Indeed, the LLM scaling laws give as a strong sense of how much, at an individual token-guessing level, the predictability of human text improves as you thrown more computational capacity and a larger training sample set at the problem, and the answer is logarithmic: doubling the product of computational capacity and dataset size produces a fixed amount of improvement in the perplexity measure.
I don’t disagree, but I don’t think that describing the process an LLM uses to generate a single token as a simulation is clarifying in this context.
I’m fairly sure the post is making no such claim, and I think it becomes a lot more likely that readers will have habryka’s interpretation if the word “simulation” is applied to LLM internals (and correctly conclude that this interpretation entails implausible claims). I think “predictor” or the like is much better here.
Unless I’m badly misunderstanding, the post is taking a time-evolution-of-a-system view of the string of tokens—not of LLM internals. I don’t think it’s claiming anything about what the internal LLM mechanism looks like.
I think janus is explicitly using the verb ‘simulate’ as opposed to ‘emulate’ because he is not making any claims about LLM internals (and indeed doesn’t think the internals, whatever they may be, include a detailed emulation), and I think that this careful distinction in terminology (which janus explicitly employs at one point in the post above, when discussing just this question, so is clearly familiar with) is sadly lost on many readers, who tend to assume that the two words mean the same thing since the word ‘simulate’ commonly misused to include ‘emulate’ — a mistake I’ve often made myself.
I agree that the word ‘predict’ would be less liable to this particular misundertanding, but I think it has some other downsides: you’d have to ask janus why he didn’t pick it.
So my claim is, if someone don’t understand why it’s called “Simulator Theory” as opposed to “Emulator Theory”, then haven’t correctly understood janus’ post. (And I have certainly seen examples of people who appear to think LLMs actually are emulators, of nearly unlimited power. For example, the ones who suggested just asking an LLM for the text of the most cited paper on AI Alignment from 2030, something that predicting correctly would require emulating a significant proportion of the world for about six years.)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I.e. the LLM acts on [string of tokens], to produce [extended string of tokens]. The simulation is the entire thing: the string of tokens changing over time according to the action of the LLM.
Saying “the LLM is a simulation” strongly suggests that a simulation process (i.e. “the imitation of the operation of a real-world process or system over time”) is occurring within the LLM internals.
Saying “GPT is a simulator” isn’t too bad—it’s like saying “The laws of physics are a simulator”. Loosely correct. Saying “GPT is a simulation” is like saying “The laws of physics are a simulation”, which is at least misleading—I’d say wrong.
In another context it might not be too bad. In this post simulation has been specifically described as “the imitation of the operation of a real-world process or system over time”. There’s no basis to think that the LLM is doing this internally.
Unless we’re claiming that it’s doing something like that internally, we can reasonably say “The LLM produces a simulation”, but not “The LLM is a simulation”.
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I guess (as an ex-physicist and long-time software engineer) I’m not really hung up about the fact that emulations are normally performed one timestep at a time, and simulations certainly can be, so didn’t see much need to make a linguistic distinction for it. But that’s fine, I don’t disagree. Yes, an emulation or (in applicable cases) simulation process will consist of a sequence of many timesteps, and an LLM predicting text similarly does so one token at a time sequentially (which may not, in fact, be the order that humans produced them, or consume them, though by default usually is — something that LLMs often have trouble with, presumably due to their fixed forward-pass computational capacity).
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
Suddenly their username makes sense! Thanks, duely noted.
An LLM is a simulation, a system statistically trained to try to predict the same distribution of outputs as a human writing process (which could be a single brain in near-real-time, or an entire Wikipedia community of them interacting over years). It is not a detailed physical emulation of either of these processes.
The simple fact that a human brain has O(1014) synapses and current LLMs only have up to O(1012) parameters makes it clear that it’s going to be a fairly rough simulation — I actuall find it pretty astonishing that we often get as good a simulation as we do out of a system that clearly has clearly orders of magnitude less computational complexity. Apparently. lot of aspects of human text generation aren’t so complex as to actually engage and require a large fraction of the entire computational capacity of the brain to get even a passable approximation to the output. Indeed, the LLM scaling laws give as a strong sense of how much, at an individual token-guessing level, the predictability of human text improves as you thrown more computational capacity and a larger training sample set at the problem, and the answer is logarithmic: doubling the product of computational capacity and dataset size produces a fixed amount of improvement in the perplexity measure.
I don’t disagree, but I don’t think that describing the process an LLM uses to generate a single token as a simulation is clarifying in this context.
I’m fairly sure the post is making no such claim, and I think it becomes a lot more likely that readers will have habryka’s interpretation if the word “simulation” is applied to LLM internals (and correctly conclude that this interpretation entails implausible claims).
I think “predictor” or the like is much better here.
Unless I’m badly misunderstanding, the post is taking a time-evolution-of-a-system view of the string of tokens—not of LLM internals.
I don’t think it’s claiming anything about what the internal LLM mechanism looks like.
I think janus is explicitly using the verb ‘simulate’ as opposed to ‘emulate’ because he is not making any claims about LLM internals (and indeed doesn’t think the internals, whatever they may be, include a detailed emulation), and I think that this careful distinction in terminology (which janus explicitly employs at one point in the post above, when discussing just this question, so is clearly familiar with) is sadly lost on many readers, who tend to assume that the two words mean the same thing since the word ‘simulate’ commonly misused to include ‘emulate’ — a mistake I’ve often made myself.
I agree that the word ‘predict’ would be less liable to this particular misundertanding, but I think it has some other downsides: you’d have to ask janus why he didn’t pick it.
So my claim is, if someone don’t understand why it’s called “Simulator Theory” as opposed to “Emulator Theory”, then haven’t correctly understood janus’ post. (And I have certainly seen examples of people who appear to think LLMs actually are emulators, of nearly unlimited power. For example, the ones who suggested just asking an LLM for the text of the most cited paper on AI Alignment from 2030, something that predicting correctly would require emulating a significant proportion of the world for about six years.)
The point I’m making here is that in the terms of this post the LLM defines the transition function of a simulation.
I.e. the LLM acts on [string of tokens], to produce [extended string of tokens].
The simulation is the entire thing: the string of tokens changing over time according to the action of the LLM.
Saying “the LLM is a simulation” strongly suggests that a simulation process (i.e. “the imitation of the operation of a real-world process or system over time”) is occurring within the LLM internals.
Saying “GPT is a simulator” isn’t too bad—it’s like saying “The laws of physics are a simulator”. Loosely correct.
Saying “GPT is a simulation” is like saying “The laws of physics are a simulation”, which is at least misleading—I’d say wrong.
In another context it might not be too bad. In this post simulation has been specifically described as “the imitation of the operation of a real-world process or system over time”. There’s no basis to think that the LLM is doing this internally.
Unless we’re claiming that it’s doing something like that internally, we can reasonably say “The LLM produces a simulation”, but not “The LLM is a simulation”.
(oh and FYI, Janus is “they”—in the sense of actually being two people: Kyle and Laria)
I guess (as an ex-physicist and long-time software engineer) I’m not really hung up about the fact that emulations are normally performed one timestep at a time, and simulations certainly can be, so didn’t see much need to make a linguistic distinction for it. But that’s fine, I don’t disagree. Yes, an emulation or (in applicable cases) simulation process will consist of a sequence of many timesteps, and an LLM predicting text similarly does so one token at a time sequentially (which may not, in fact, be the order that humans produced them, or consume them, though by default usually is — something that LLMs often have trouble with, presumably due to their fixed forward-pass computational capacity).
Suddenly their username makes sense! Thanks, duely noted.