I think the main thing I’d point to is this section (where I’ve changed bullet points to numbers for easier reference):
I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:
The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.
It does not imply that the AI’s behavior is well-described (globally or locally) as expected utility maximization. An arbitrarily powerful/accurate simulation can depict arbitrarily hapless sims.
It does not imply that the AI is only capable of emulating things with direct precedent in the training data. A physics simulation, for instance, can simulate any phenomena that plays by its rules.
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
It emphasizes the role of the state in specifying and constructing the agent/process. The importance of prompt programming for capabilities is obvious if you think of the prompt as specifying a configuration that will be propagated forward in time.
It emphasizes the interactive nature of the model’s predictions – even though they’re “just text”, you can converse with simulacra, explore virtual environments, etc.
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
I think (2)-(8) are basically correct, (1) isn’t really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don’t think those claims are in this post.)
I also think it does in fact constrain expectations. Here’s a claim that I think this post points to: “To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do”. Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it’s a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.
(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF’d or instruction tuned models.)
Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things—to people who actually played with / thought about LLMs. But I’m not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.
To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on.
Like, the above sounds similar to me to “in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does”. Of course, that is approximately useless advice, since basically all you’ve done is describe the training setup of AlphaZero.
As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to “simulate”. I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).
To maybe go into a bit more depth on what wrong predictions I’ve seen people make on the basis of this post:
I’ve seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it’s quite hard to know whether base models have any situational awareness, though RLHF’d models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run)
I’ve seen people make strong predictions that LLM performance can’t become superhuman on various tasks, since it’s just simulating human cognition, including on tasks where LLMs now have achieved superhuman performance
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate, since there is no real-world process with as broad of a knowledge base that produces text like that, except LLMs themselves (maybe you are making claims that only apply to base models, but I both fail to see the relevance in that case since base models are basically irrelevant these days, and am skeptical about people making claims about LLM cognition that apply only to RLHF’d models and not the base models given that the vast majority of datapoints that shaped the LLMs cognition come from the base model and not the RLHF portion)
I’ve seen people say that because LLMs are just “simulators” that ultimately we can just scale them up as far as we want, and all we will get are higher-fidelity simulations of the process that created the training distribution, basically eliminating any risk from scaling with current architectures.
I think all of these predictions are pretty unwarranted, and some of them have been demonstrated to be false.
They also seem to me like predictions this post makes, and not just misunderstandings of people reading this post, but I am not sure. I am very familiar with the experience of other people asserting that a post makes predictions it is not making, because they observed someone who misunderstood the post and then made some bad predictions.
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn’t reread this post carefully but I don’t remember seeing mechanistic claims in it.
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on. [...]
I mostly agree and this is an aspect of what I mean by “this post says obvious and uncontroversial things”. I’m not particularly advocating for this post in the review; I didn’t find it especially illuminating.
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate
This seems somewhat in conflict with the previous quote?
Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn’t work for RLHF’d models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don’t have a strong opinion there.
I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we’re in the rare setting where text is being produced by an entity with those abilities.
Hmm, yeah, this perspective makes more sense to me, and I don’t currently believe you ended up making any of the wrong inferences I’ve seen others make on the basis of the post.
I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:
Broadly it views these models as simulating a learned distribution with various degrees of fidelity, which in the case of language models trained on a large corpus of text is the mechanics underlying our world.
This also directly claims that the physics the system learned are “the mechanics underlying our world”, which I think isn’t totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model’s cognition.
In general I believe that many (most?) people take it too far and make incorrect inferences—partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.
sometimes putting a name to what you “already know” makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment “it seems like you aren’t thinking of GPT as a simulator!”
I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn’t seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn’t agree with all of what was said, e.g. nostalgebraist doesn’t seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like “reasonable disagreement” than “wildly misled by simulator framing”.)
I think the main thing I’d point to is this section (where I’ve changed bullet points to numbers for easier reference):
I think (2)-(8) are basically correct, (1) isn’t really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don’t think those claims are in this post.)
I also think it does in fact constrain expectations. Here’s a claim that I think this post points to: “To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do”. Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it’s a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.
(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF’d or instruction tuned models.)
Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things—to people who actually played with / thought about LLMs. But I’m not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on.
Like, the above sounds similar to me to “in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does”. Of course, that is approximately useless advice, since basically all you’ve done is describe the training setup of AlphaZero.
As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to “simulate”. I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).
To maybe go into a bit more depth on what wrong predictions I’ve seen people make on the basis of this post:
I’ve seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it’s quite hard to know whether base models have any situational awareness, though RLHF’d models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run)
I’ve seen people make strong predictions that LLM performance can’t become superhuman on various tasks, since it’s just simulating human cognition, including on tasks where LLMs now have achieved superhuman performance
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate, since there is no real-world process with as broad of a knowledge base that produces text like that, except LLMs themselves (maybe you are making claims that only apply to base models, but I both fail to see the relevance in that case since base models are basically irrelevant these days, and am skeptical about people making claims about LLM cognition that apply only to RLHF’d models and not the base models given that the vast majority of datapoints that shaped the LLMs cognition come from the base model and not the RLHF portion)
I’ve seen people say that because LLMs are just “simulators” that ultimately we can just scale them up as far as we want, and all we will get are higher-fidelity simulations of the process that created the training distribution, basically eliminating any risk from scaling with current architectures.
I think all of these predictions are pretty unwarranted, and some of them have been demonstrated to be false.
They also seem to me like predictions this post makes, and not just misunderstandings of people reading this post, but I am not sure. I am very familiar with the experience of other people asserting that a post makes predictions it is not making, because they observed someone who misunderstood the post and then made some bad predictions.
Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn’t reread this post carefully but I don’t remember seeing mechanistic claims in it.
I mostly agree and this is an aspect of what I mean by “this post says obvious and uncontroversial things”. I’m not particularly advocating for this post in the review; I didn’t find it especially illuminating.
This seems somewhat in conflict with the previous quote?
Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn’t work for RLHF’d models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don’t have a strong opinion there.
I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we’re in the rare setting where text is being produced by an entity with those abilities.
Hmm, yeah, this perspective makes more sense to me, and I don’t currently believe you ended up making any of the wrong inferences I’ve seen others make on the basis of the post.
I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:
This also directly claims that the physics the system learned are “the mechanics underlying our world”, which I think isn’t totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model’s cognition.
Yeah, agreed that’s a clear overclaim.
In general I believe that many (most?) people take it too far and make incorrect inferences—partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.
Fwiw I was sympathetic to nostalgebraist’s positive review saying:
I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn’t seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn’t agree with all of what was said, e.g. nostalgebraist doesn’t seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like “reasonable disagreement” than “wildly misled by simulator framing”.)
Yeah—I just noticed this ”...is the mechanics underlying our world.” on the tag page.
Agreed that it’s inaccurate and misleading.
I hadn’t realized it was being read this way.