The first time I read it, I was underwhelmed. My reaction was: “well, yeah, duh. Isn’t this all kind of obvious if you’ve worked with GPTs? I guess it’s nice that someone wrote it down, in case anyone doesn’t already know this stuff, but it’s not going to shift my own thinking.”
But sometimes putting a name to what you “already know” makes a whole world of difference.
Before I read “Simulators,” when I’d encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one-forward-pass inference as the basic thing that GPT “does” … well, I would immediately think “that doesn’t sound right,” and sometimes I would go on to think about why, and concoct some kind of argument.
But it didn’t feel like I had a crisp sense of what mistake(s) these people were making, even though I “already knew” all the low-level stuff that led me to conclude that some mistake was being made—the same low-level facts that Janus marshals here for the same purpose.
It just felt like I lived in a world where lots of different people said lots of different things about GPTs, and a lot of these things just “felt wrong,” and these feelings-of-wrongness could be (individually, laboriously) converted into arguments against specific GPT-opiners on specific occasions.
Now I can just say “it seems like you aren’t thinking of GPT as a simulator!” (Possibly followed by “oh, have you read Simulators?”) One size fits all: this remark unifies my objections to a bunch of different “wrong-feeling” claims about GPTs, which would earlier have seem wholly unrelated to one another.
This seems like a valuable improvement in the discourse.
And of course, it affected my own thinking as well. You think faster when you have a name for something; you can do in one mental step what used to take many steps, because a frequently handy series of steps has been collapsed into a single, trusted word that stands in for them.
Given how much this post has been read and discussed, it surprises me how often I still see the same mistakes getting made.
I’m not talking about people who’ve read the post and disagree with it; that’s fine and healthy and good (and, more to the point, unsurprising).
I’m talking about something else—that the discourse seems to be in a weird transitional state, where people have read this post and even appear to agree with it, but go on casually treating GPTs as vaguely humanlike and psychologically coherent “AIs” which might be Buddhist or racist or power-seeking, or as baby versions of agent-foundations-style argmaxxers which haven’t quite gotten to the argmax part yet, or as alien creatures which “pretend to be” (??) the other creatures which their sampled texts are about, or whatever.
All while paying too little attention to the vast range of possible simulacra, e.g. by playing fast and loose with the distinction between “all simulacra this model can simulate” and “how this model responds to a particular prompt” and “what behaviors a reward model scores highly when this model does them.”
I see these takes, and I uniformly respond with some version of the sentiment “it seems like you aren’t thinking of GPT as a simulator!” And people always seem to agree with me, when I say this, and give me lots of upvotes and stuff. But this leaves me confused about how I ended up in a situation where I felt like making the comment in the first place.
It feels like I’m arbitraging some mispriced assets, and every time I do it I make money and people are like “dude, nice trade!”, but somehow no one else thinks to make the same trade themselves, and the prices stay where they are.
Scott Alexander expressed a similar sentiment in Feb 2023:
I don’t think AI safety has fully absorbed the lesson from Simulators: the first powerful AIs might be simulators with goal functions very different from the typical Bostromian agent. They might act in humanlike ways. They might do alignment research for us, if we ask nicely. I don’t know what alignment research aimed at these AIs would look like and people are going to have to invent a whole new paradigm for it. But also, these AIs will have human-like failure modes. If you give them access to a gun, they will shoot people, not as part of a 20-dimensional chess strategy that inevitably ends in world conquest, but because they’re buggy, or even angry.
That last sentence resonates. Next-generation GPTs will be potentially dangerous, if nothing else because they’ll be very good imitators of humans (+ in possession of a huge collection of knowledge/etc. that no individual human has), and humans can be quite dangerous.
A lot of current alignment discussion (esp. deceptive alignment stuff) feels to me like an increasingly desperate series of attempts to say “here’s how 20-dimensional chess strategies that inevitably end in world conquest can still win[1]!” As if people are flinching away from the increasingly plausible notion that AI will simply do bad things for recognizable, human reasons; as if the injunction to not anthropomorphize the AI has been taken so much to heart that people are unable to recognize actually, meaningfullyanthropomorphic AIs—AIs for which the hypothesis “this is like a human” keeps making the right prediction, over and over—even when those AIs are staring them right in the face.[2]
Which is to say, I think AI safety still has not fully absorbed the lesson from Simulators, and I think this matters.
One quibble I do have with this post—it uses a lot of LW jargon, and links to Sequences posts, and stuff like that. Most of this seems extraneous or unnecessary to me, while potentially limiting the range of its audience.
(I know of one case where I recommended the post to someone and they initially bounced off it because of this “aggressively rationalist” style, only to come back and read the whole thing later, and then be glad they they had. A near miss.)
This post snuck up on me.
The first time I read it, I was underwhelmed. My reaction was: “well, yeah, duh. Isn’t this all kind of obvious if you’ve worked with GPTs? I guess it’s nice that someone wrote it down, in case anyone doesn’t already know this stuff, but it’s not going to shift my own thinking.”
But sometimes putting a name to what you “already know” makes a whole world of difference.
Before I read “Simulators,” when I’d encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one-forward-pass inference as the basic thing that GPT “does” … well, I would immediately think “that doesn’t sound right,” and sometimes I would go on to think about why, and concoct some kind of argument.
But it didn’t feel like I had a crisp sense of what mistake(s) these people were making, even though I “already knew” all the low-level stuff that led me to conclude that some mistake was being made—the same low-level facts that Janus marshals here for the same purpose.
It just felt like I lived in a world where lots of different people said lots of different things about GPTs, and a lot of these things just “felt wrong,” and these feelings-of-wrongness could be (individually, laboriously) converted into arguments against specific GPT-opiners on specific occasions.
Now I can just say “it seems like you aren’t thinking of GPT as a simulator!” (Possibly followed by “oh, have you read Simulators?”) One size fits all: this remark unifies my objections to a bunch of different “wrong-feeling” claims about GPTs, which would earlier have seem wholly unrelated to one another.
This seems like a valuable improvement in the discourse.
And of course, it affected my own thinking as well. You think faster when you have a name for something; you can do in one mental step what used to take many steps, because a frequently handy series of steps has been collapsed into a single, trusted word that stands in for them.
Given how much this post has been read and discussed, it surprises me how often I still see the same mistakes getting made.
I’m not talking about people who’ve read the post and disagree with it; that’s fine and healthy and good (and, more to the point, unsurprising).
I’m talking about something else—that the discourse seems to be in a weird transitional state, where people have read this post and even appear to agree with it, but go on casually treating GPTs as vaguely humanlike and psychologically coherent “AIs” which might be Buddhist or racist or power-seeking, or as baby versions of agent-foundations-style argmaxxers which haven’t quite gotten to the argmax part yet, or as alien creatures which “pretend to be” (??) the other creatures which their sampled texts are about, or whatever.
All while paying too little attention to the vast range of possible simulacra, e.g. by playing fast and loose with the distinction between “all simulacra this model can simulate” and “how this model responds to a particular prompt” and “what behaviors a reward model scores highly when this model does them.”
I see these takes, and I uniformly respond with some version of the sentiment “it seems like you aren’t thinking of GPT as a simulator!” And people always seem to agree with me, when I say this, and give me lots of upvotes and stuff. But this leaves me confused about how I ended up in a situation where I felt like making the comment in the first place.
It feels like I’m arbitraging some mispriced assets, and every time I do it I make money and people are like “dude, nice trade!”, but somehow no one else thinks to make the same trade themselves, and the prices stay where they are.
Scott Alexander expressed a similar sentiment in Feb 2023:
That last sentence resonates. Next-generation GPTs will be potentially dangerous, if nothing else because they’ll be very good imitators of humans (+ in possession of a huge collection of knowledge/etc. that no individual human has), and humans can be quite dangerous.
A lot of current alignment discussion (esp. deceptive alignment stuff) feels to me like an increasingly desperate series of attempts to say “here’s how 20-dimensional chess strategies that inevitably end in world conquest can still win[1]!” As if people are flinching away from the increasingly plausible notion that AI will simply do bad things for recognizable, human reasons; as if the injunction to not anthropomorphize the AI has been taken so much to heart that people are unable to recognize actually, meaningfully anthropomorphic AIs—AIs for which the hypothesis “this is like a human” keeps making the right prediction, over and over—even when those AIs are staring them right in the face.[2]
Which is to say, I think AI safety still has not fully absorbed the lesson from Simulators, and I think this matters.
One quibble I do have with this post—it uses a lot of LW jargon, and links to Sequences posts, and stuff like that. Most of this seems extraneous or unnecessary to me, while potentially limiting the range of its audience.
(I know of one case where I recommended the post to someone and they initially bounced off it because of this “aggressively rationalist” style, only to come back and read the whole thing later, and then be glad they they had. A near miss.)
I.e. can still be important alignment failure modes. But I couldn’t resist the meme phrasing.
By “AIs” in this paragraph, I of course mean simulacra, not simulators.