I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.
It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don’t think there’s any real tension in the ontology here.
We need to establish a clear conceptual distinction between two entities often referred to as “GPT” —
The autoregressive language model μ:Tk→Δ(T) which maps a prompt x∈Tk to a distribution over tokens μ(⋅|x)∈Δ(T).
The dynamic system that emerges from stochastically generating tokens using μ while also deleting the start token
Don’t conflate them! These two entities are distinct and must be treated as such. I’ve started calling the first entity “Static GPT” and the second entity “Dynamic GPT”, but I’m open to alternative naming suggestions. It is crucial to distinguish these two entities clearly in our minds because they differ in two significant ways: capabilities and safety.
Capabilities:
Static GPT has limited capabilities since it consists of a single forward pass through a neural network and is only capable of computing functions that are O(1). In contrast, Dynamic GPT is practically Turing-complete, making it capable of computing a vast range of functions.
Safety:
If mechanistic interpretability is successful, then it might soon render Static GPT entirely predictable, explainable, controllable, and interpretable. However, this would not automatically extend to Dynamic GPT. This is because Static GPT describes the time evolution of Dynamic GPT, but even simple rules can produce highly complex systems.
In my opinion, Static GPT is unlikely to possess agency, but Dynamic GPT has a higher likelihood of being agentic. An upcoming article will elaborate further on this point.
This remark is the most critical point in this article. While Static GPT and Dynamic GPT may seem similar, they are entirely different beasts.
It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don’t think there’s any real tension in the ontology here.
EY gave a tension, or at least a way in which viewing Simulators as a semantic primitive, versus an approximate consequence of a predictive model, is misleading. I’ll try to give it again from another angle.
To give the sort of claim worth objecting to, and I think is an easy trap to get caught on even though I don’t think the original Simulators post was confused, here is a quote from that post: “GPT doesn’t seem to care which agent it simulates, nor if the scene ends and the agent is effectively destroyed.” Namely, the idea is that a GPT rollout is a stochastic sample of a text generating source, or possibly a set of them in superposition.
Consider again the task of predicting first a cryptographic hash and then the text which hashes to it, or rather the general class of algorithms for which the forward pass (hashing) is tractable for the network and the backwards pass (breaking the hash) is not, for which predicting cryptographic hashes is a limiting case.
If a model rollout was primarily trying to be a superposition of one or more coherent simulations, there is a computationally tractable approach to this task: internally sample a set of phrases, then compute their hashes, then narrow down the subset of sampled hashes as the hash is sampled from, then output the prior text.
Instead, a GPT model will produce a random semantically-meaningless hash and then sample unrelated text. Even if seeded from the algorithm above, backprop will select away from the superposition and towards the distributional, predictive model. This holds even in the case where the GPT has an entropy source that would allow it to be distributionally perfect when rolled out from the start! Backprop will still say no, your goal is prediction, not simulation. As EY says, this is not a GAN.
Again, I don’t think the original Simulators post was necessarily confused about any of this, but I also agree with this post that the terminology is imprecise and the differences can be important.
I can see why your algorithm is hard for GPT — unless it predicts the follow up string perfectly, there’s no benefit to hashing correctly — but I don’t see why it’s impossible. What if it perfectly predicts the follow up?
This is by construction: I am choosing a task for which one direction is tractable and the other is not. The existence of such tasks follows from standard cryptographic arguments, the specifics of the limiting case are less relevant.
If you want to extrapolate to models strong enough to beat SHA256, you have already conceded EY’s point as this is a superhuman task at least relative to the generators of the training data, but anyway there will still exist similar tasks of equal or slightly longer length for which it will hold again because of basic cryptographic arguments, possibly using a different hashing scheme.
Note that this argument requires the text to have sufficiently high entropy for the hash to not be predictable a priori.
It’s the final claim I’m disputing—that the hashed text cannot itself be predicted. There’s still a benefit to going from e.g.10−20 to 10−10 probability of a correct hash. It may not be a meaningful difference in practice, but there’s still a benefit in principle, and in practice it could also just generalise a strategy it learned for cases with low entropy text.
The mathematical counterpoint is that this again only holds for sufficiently low entropy completions, which need not be the case, and if you want to make this argument against computronium suns you run into issues earlier than a reasonably defined problem statement does.
The practical counterpoint is that from the perspective of a simulator graded by simulation success, such an improvement might be marginally selected for, because epsilon is bigger than zero, but from the perspective of the actual predictive training dynamics, a policy with a success rate that low is ruthlessly selected against, and the actual policy of selecting the per-token base rate for the hash dominates, because epsilon is smaller than 1⁄64.
They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.
I think an issue is that GPT is used to mean two things:
A predictive model whose output is a probability distribution over token space given its prompt and context
Any particular techniques/strategies for sampling from the predictive model to generate responses/completions for a given prompt.
[See the Appendix]
The latter kind of GPT, is what I think is rightly called a “Simulator”.
From @janus’ Simulators (italicised by me):
It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don’t think there’s any real tension in the ontology here.
Appendix
Credit for highlighting this distinction belongs to @Cleo Nardo:
To summarise:
Static GPT: GPT as predictor
Dynamic GPT: GPT as simulator
Predictors are (with a sampling loop) simulators! That’s the secret of mind
Do not say the sampling too lightly, there is likely an amazing delicacy around it.’+)
EY gave a tension, or at least a way in which viewing Simulators as a semantic primitive, versus an approximate consequence of a predictive model, is misleading. I’ll try to give it again from another angle.
To give the sort of claim worth objecting to, and I think is an easy trap to get caught on even though I don’t think the original Simulators post was confused, here is a quote from that post: “GPT doesn’t seem to care which agent it simulates, nor if the scene ends and the agent is effectively destroyed.” Namely, the idea is that a GPT rollout is a stochastic sample of a text generating source, or possibly a set of them in superposition.
Consider again the task of predicting first a cryptographic hash and then the text which hashes to it, or rather the general class of algorithms for which the forward pass (hashing) is tractable for the network and the backwards pass (breaking the hash) is not, for which predicting cryptographic hashes is a limiting case.
If a model rollout was primarily trying to be a superposition of one or more coherent simulations, there is a computationally tractable approach to this task: internally sample a set of phrases, then compute their hashes, then narrow down the subset of sampled hashes as the hash is sampled from, then output the prior text.
Instead, a GPT model will produce a random semantically-meaningless hash and then sample unrelated text. Even if seeded from the algorithm above, backprop will select away from the superposition and towards the distributional, predictive model. This holds even in the case where the GPT has an entropy source that would allow it to be distributionally perfect when rolled out from the start! Backprop will still say no, your goal is prediction, not simulation. As EY says, this is not a GAN.
Again, I don’t think the original Simulators post was necessarily confused about any of this, but I also agree with this post that the terminology is imprecise and the differences can be important.
I can see why your algorithm is hard for GPT — unless it predicts the follow up string perfectly, there’s no benefit to hashing correctly — but I don’t see why it’s impossible. What if it perfectly predicts the follow up?
This is by construction: I am choosing a task for which one direction is tractable and the other is not. The existence of such tasks follows from standard cryptographic arguments, the specifics of the limiting case are less relevant.
If you want to extrapolate to models strong enough to beat SHA256, you have already conceded EY’s point as this is a superhuman task at least relative to the generators of the training data, but anyway there will still exist similar tasks of equal or slightly longer length for which it will hold again because of basic cryptographic arguments, possibly using a different hashing scheme.
Note that this argument requires the text to have sufficiently high entropy for the hash to not be predictable a priori.
It’s the final claim I’m disputing—that the hashed text cannot itself be predicted. There’s still a benefit to going from e.g.10−20 to 10−10 probability of a correct hash. It may not be a meaningful difference in practice, but there’s still a benefit in principle, and in practice it could also just generalise a strategy it learned for cases with low entropy text.
The mathematical counterpoint is that this again only holds for sufficiently low entropy completions, which need not be the case, and if you want to make this argument against computronium suns you run into issues earlier than a reasonably defined problem statement does.
The practical counterpoint is that from the perspective of a simulator graded by simulation success, such an improvement might be marginally selected for, because epsilon is bigger than zero, but from the perspective of the actual predictive training dynamics, a policy with a success rate that low is ruthlessly selected against, and the actual policy of selecting the per-token base rate for the hash dominates, because epsilon is smaller than 1⁄64.
Are hash characters non uniform? Then I’d agree my point doesn’t stand
They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.