It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don’t think there’s any real tension in the ontology here.
EY gave a tension, or at least a way in which viewing Simulators as a semantic primitive, versus an approximate consequence of a predictive model, is misleading. I’ll try to give it again from another angle.
To give the sort of claim worth objecting to, and I think is an easy trap to get caught on even though I don’t think the original Simulators post was confused, here is a quote from that post: “GPT doesn’t seem to care which agent it simulates, nor if the scene ends and the agent is effectively destroyed.” Namely, the idea is that a GPT rollout is a stochastic sample of a text generating source, or possibly a set of them in superposition.
Consider again the task of predicting first a cryptographic hash and then the text which hashes to it, or rather the general class of algorithms for which the forward pass (hashing) is tractable for the network and the backwards pass (breaking the hash) is not, for which predicting cryptographic hashes is a limiting case.
If a model rollout was primarily trying to be a superposition of one or more coherent simulations, there is a computationally tractable approach to this task: internally sample a set of phrases, then compute their hashes, then narrow down the subset of sampled hashes as the hash is sampled from, then output the prior text.
Instead, a GPT model will produce a random semantically-meaningless hash and then sample unrelated text. Even if seeded from the algorithm above, backprop will select away from the superposition and towards the distributional, predictive model. This holds even in the case where the GPT has an entropy source that would allow it to be distributionally perfect when rolled out from the start! Backprop will still say no, your goal is prediction, not simulation. As EY says, this is not a GAN.
Again, I don’t think the original Simulators post was necessarily confused about any of this, but I also agree with this post that the terminology is imprecise and the differences can be important.
I can see why your algorithm is hard for GPT — unless it predicts the follow up string perfectly, there’s no benefit to hashing correctly — but I don’t see why it’s impossible. What if it perfectly predicts the follow up?
This is by construction: I am choosing a task for which one direction is tractable and the other is not. The existence of such tasks follows from standard cryptographic arguments, the specifics of the limiting case are less relevant.
If you want to extrapolate to models strong enough to beat SHA256, you have already conceded EY’s point as this is a superhuman task at least relative to the generators of the training data, but anyway there will still exist similar tasks of equal or slightly longer length for which it will hold again because of basic cryptographic arguments, possibly using a different hashing scheme.
Note that this argument requires the text to have sufficiently high entropy for the hash to not be predictable a priori.
It’s the final claim I’m disputing—that the hashed text cannot itself be predicted. There’s still a benefit to going from e.g.10−20 to 10−10 probability of a correct hash. It may not be a meaningful difference in practice, but there’s still a benefit in principle, and in practice it could also just generalise a strategy it learned for cases with low entropy text.
The mathematical counterpoint is that this again only holds for sufficiently low entropy completions, which need not be the case, and if you want to make this argument against computronium suns you run into issues earlier than a reasonably defined problem statement does.
The practical counterpoint is that from the perspective of a simulator graded by simulation success, such an improvement might be marginally selected for, because epsilon is bigger than zero, but from the perspective of the actual predictive training dynamics, a policy with a success rate that low is ruthlessly selected against, and the actual policy of selecting the per-token base rate for the hash dominates, because epsilon is smaller than 1⁄64.
They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.
EY gave a tension, or at least a way in which viewing Simulators as a semantic primitive, versus an approximate consequence of a predictive model, is misleading. I’ll try to give it again from another angle.
To give the sort of claim worth objecting to, and I think is an easy trap to get caught on even though I don’t think the original Simulators post was confused, here is a quote from that post: “GPT doesn’t seem to care which agent it simulates, nor if the scene ends and the agent is effectively destroyed.” Namely, the idea is that a GPT rollout is a stochastic sample of a text generating source, or possibly a set of them in superposition.
Consider again the task of predicting first a cryptographic hash and then the text which hashes to it, or rather the general class of algorithms for which the forward pass (hashing) is tractable for the network and the backwards pass (breaking the hash) is not, for which predicting cryptographic hashes is a limiting case.
If a model rollout was primarily trying to be a superposition of one or more coherent simulations, there is a computationally tractable approach to this task: internally sample a set of phrases, then compute their hashes, then narrow down the subset of sampled hashes as the hash is sampled from, then output the prior text.
Instead, a GPT model will produce a random semantically-meaningless hash and then sample unrelated text. Even if seeded from the algorithm above, backprop will select away from the superposition and towards the distributional, predictive model. This holds even in the case where the GPT has an entropy source that would allow it to be distributionally perfect when rolled out from the start! Backprop will still say no, your goal is prediction, not simulation. As EY says, this is not a GAN.
Again, I don’t think the original Simulators post was necessarily confused about any of this, but I also agree with this post that the terminology is imprecise and the differences can be important.
I can see why your algorithm is hard for GPT — unless it predicts the follow up string perfectly, there’s no benefit to hashing correctly — but I don’t see why it’s impossible. What if it perfectly predicts the follow up?
This is by construction: I am choosing a task for which one direction is tractable and the other is not. The existence of such tasks follows from standard cryptographic arguments, the specifics of the limiting case are less relevant.
If you want to extrapolate to models strong enough to beat SHA256, you have already conceded EY’s point as this is a superhuman task at least relative to the generators of the training data, but anyway there will still exist similar tasks of equal or slightly longer length for which it will hold again because of basic cryptographic arguments, possibly using a different hashing scheme.
Note that this argument requires the text to have sufficiently high entropy for the hash to not be predictable a priori.
It’s the final claim I’m disputing—that the hashed text cannot itself be predicted. There’s still a benefit to going from e.g.10−20 to 10−10 probability of a correct hash. It may not be a meaningful difference in practice, but there’s still a benefit in principle, and in practice it could also just generalise a strategy it learned for cases with low entropy text.
The mathematical counterpoint is that this again only holds for sufficiently low entropy completions, which need not be the case, and if you want to make this argument against computronium suns you run into issues earlier than a reasonably defined problem statement does.
The practical counterpoint is that from the perspective of a simulator graded by simulation success, such an improvement might be marginally selected for, because epsilon is bigger than zero, but from the perspective of the actual predictive training dynamics, a policy with a success rate that low is ruthlessly selected against, and the actual policy of selecting the per-token base rate for the hash dominates, because epsilon is smaller than 1⁄64.
Are hash characters non uniform? Then I’d agree my point doesn’t stand
They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.