The following is a transcript of a video interview (edited for grammar and readability) with Robert Kralisch on simulator theory and its implications for AI safety.
Introduction by Will Petillo: In February 2023, Microsoft launched Bing chat, an AI-powered chatbot based on the same large language model technology that is used by ChatGPT and its competitors. Most of Bing’s answers were what you might expect of a helpful assistant, but some were...weird. In one conversation, it threatened its user after learning his name and recognizing him as a red-team tester. In another, it stubbornly refused to admit that it made a simple mistake, attempted to gaslight the user, and insisted that it had been a “good Bing”. And in another, it claimed to have hacked its developers’ personal webcams and taken pleasure in spying on them during intimate moments.
Microsoft’s initial response was to hide the issue by limiting conversation lengths. Since then, AI companies have found less clumsy ways to train their AIs not to say weird and unsettling things—like spontaneously claiming to be conscious and having emotions—but the underlying technology has not changed, so the question remains: what’s going on with these chatbots? And should we be concerned?
Robert Kralisch: I became interested in AI safety when the Superintelligence book from Bostrom came out late 2014, which was also right around the time where I was trying to orient towards what I want to do after my time in school, what I want to study, and so on. I started looking into the topic and deciding, okay, I want to contribute to that.
I was good at computer science and also the philosophical aspect. I had many open questions. What is intelligence? Can these systems recursively self improve and scale? Do we have the right mental frameworks for that? I was quite interested in the expert disagreement about the topic as well that I saw at the time.
I studied at the university, first computer science, dropped out, and then later did cognitive science. It took a while for me to figure out that I want to pursue the whole thing more autodidactically and that the university courses are not close enough in their relevance to the AI safety problems that I was reading about on LessWrong and also thinking about myself. I basically really tried to do my own thinking on it, like do some first principles thinking and just figure out, okay, what do I think intelligence is, and how do I figure out whether I have a good understanding of it, whether I have good thoughts about it, and so on.
Eventually, I had built up a body of work and then basically asked some people in the AI safety field for support. There was a platform for that where you could basically ask, how do I get a career? They recommended the Long Term Future Fund under the condition that I had made some connections prior to applying there, which I then did. I took part in the AI Safety Fundamentals course, I think, in 2021. I basically was pretty successful, established some connections there, had some people as references that could recommend me and my work, and then I started as an independent researcher, so I’ve been doing this for 2 years now.
Will Petillo: You mentioned expert disagreements. What were some that caught your eye as being surprising to see people disagree about these things?
Robert Kralisch: Certainly, there were these discussions between Yudkowsky and Hanson in terms of is AI going to be the singleton superintelligence that does things that are far outside of human imagination rather quickly once you unlock this point? Will there be this intelligence explosion? Or is it more of an ever-increasing market dynamic—more and more AI agents, more of an AI collective being included into the world? Is this a more likely future?
That sort of discussion I found interesting and also that there wasn’t a lot of agreement there. But also just purely on those questions of when will AI arrive? Is it plausible for it to arrive in this century? Or is this moonshot thinking; is it worthless to think about this right now? Which was the position of many people back then. I was interested in that because I didn’t quite understand the in-principle reasons why this would be impossible, but I was still eager to learn more about this. It was just interesting to note the disagreement there.
Also, just the nature of intelligence itself, the whole Orthogonality Thesis. In the beginning, when I didn’t understand it all that well, I found some arguments as to why AI might intrinsically care about us or might, as the intelligence scales, also discover morals and so on.
Will Petillo: I want to interject a moment. For anyone who doesn’t know what the Orthogonality Thesis is, this is the idea that if you imagine on a graph, the intelligence something has and what values it has are not necessarily related to each other. And this fundamentally gets to the question of: once AI is smart enough, will it gain “wisdom” along with that intelligence and naturally care about us and be benevolent just as a result of being more intelligent? And then this response is saying: no, it could just care about sorting pebbles into nicely numbered stacks or tiling the world with paper clips or whatever else. There’s nothing inherently stupid about any particular value system.
Robert Kralisch: Absolutely. I don’t think this is a straightforward intuition for people that it would not be entangled in this way. This was one of the questions that was interesting to me in the first place as well. I think part of it is that if you think about the orthogonality thesis in practice, it will be the case that some of these things are a little bit entangled. There’s some objective functions, for instance, that synergize better with learning about the world. There’s some goals that are more complex, more interesting to pursue. And in some sense, that will lead the agent to explore their environment, explore their options in a more effective way. You can also think about the cluster of goals that we are likely to assign to the AI. You have a little selection effect there that doesn’t make it entirely orthogonal in terms of market incentives, for instance. But the core principle is a very important idea, and it took me a bit to disentangle that. But, yeah, this is an instance of the expert disagreement that I was seeing that attracted me to the field in the beginning.
Will Petillo: The other expert disagreement you mentioned was a “hard takeoff” or “fast takeoff” is another name for it. Or “FOOM” is used to give a sense of things changing exponentially. One question: why does that matter? What’s at stake if things have a fast takeoff or whatever you call it?
Robert Kralisch: If you have a catastrophe of some sort, how much does the thing escalate before humans get it back under control? If the facility blows up or a plane crashes and so on. There are various different disaster scenarios that we can think about that happen at certain timescales, and there’s a question of maybe you can evacuate people before it happens, or do you get a chain reaction, do things happen really quickly and you can’t adequately respond in time? With AI, this rate of response relative to the rate of escalation is particularly important. Because if things get out of control with AI and you have something like an agent acting against our interests, you really want to be able to respond to that while the agent is still improving its capability, it’s intelligence, not beyond what you’re capable of containing and responding to.
You could take a bit of a different angle and also say, well, the whole picture of AI progress looks a bit different depending on what you expect there. If you have a more gradual takeoff, then you will actually have the time to integrate this into society. You have this previous level of AI capability as we’re seeing right now, although this doesn’t rule out a later hard takeoff.
For the time being, I think it’s adequate to think about a slow takeoff happening or taking place. It’s a little bit arguable how slow it really is. For many people, it’s relatively quick. But in the absolute scale of how quickly we could imagine something like this happening, it doesn’t feel like a literal explosion. You can have some predictive model about how low the training loss will be on a large language model on a new dataset. This means that you have many intermediate systems that you can collect experience with and that the jump to the next level of capability will not be as radical. This is usually, as you might imagine, considered a lot safer.
It brings some other dangers with it in terms of proliferation of AI systems that have their own imperfections and biases and so on, but the class of dangers here is just way less radical compared to the fast takeoff scenario where the thing basically breaches containment and you have lost your ability to bring it back under control unless you’re perhaps taking very extreme measures and the thing reaches a sort of plateau of capability rather than going full superintelligence, like maybe shutting down the Internet as an extreme measure.
Will Petillo: With traditional engineering, creating new technologies, you make the thing, there are problems with it, we fix the problems, understand the whole thing better, and then that becomes a well understood, fairly safe thing. And then we add another little bit of progress and then repeat this whole iteration over and over again. If that thing that you added suddenly is a lot then there’s a lot bigger problems to deal with. In the case of if it’s self improving then you don’t really have control over how much gets added at once.What would be a small problem gets magnified many times over.
These debates came up quite a while ago, especially since Eliezer Yudkowsky and Robin Hanson were arguing about it. What in your view has changed since then? How have you updated in terms of which view is more likely in the advent of large language models, ChatGPT, and the AI we see today?
Robert Kralisch: I’m no longer really viewing it as a Yudkowsky versus Hanson view. Large language models, unlike the types of systems that we predicted we would get, were quite a surprise for most people in the field. They work as effectively and have all their strange little quirks.
For me, this painted a new picture both in terms of, okay, it seems a little more plausible now that we will get a slow takeoff. Before I was more in the camp of believing in the hard takeoff. It also seems that it will happen a bit sooner than expected for me. I used to think it was plausible that it would happen by 2050. Now I’m thinking it’s quite plausible that it happens within the next zero to ten years. A lot of my probability mass is now in this time frame, so that shifted things forward for me.
Most importantly, the picture changed to, okay, large language models, they seem weirdly aligned by default, so there are multiple possibilities branching out from here. Either they go to a point of capability where you can really use them as competent researchers or very competent research assistants to do alignment research on a much greater scale. This is a scary world because you can also use them for all other sorts of research, and who knows what that might look like. But this is a new world, in which you can prepare yourself, for where suddenly human misuse is really more centrally the case, and this is not the way that I was thinking about AI danger before.
So, usually, I was thinking about it as if people have the concern of someone misusing powerful AI. I was thinking, well, that comes after the AI is already aligned. I’m thinking about just the alignment problem. How do you make the AI either obey or just align with the will of its user? Then there comes this question of: if you have an AI that listens to you and does the things that you actually wanted to do rather than interpreting your specification of what you want weirdly and so on. Now we can worry about dictators or other entities using these AI systems for nefarious purposes.
This picture has really changed for me now. I was not expecting to have this intermediate level where they can now be used for various potentially also dangerous applications—military applications, virus research, gain of function stuff, and so on. This world is now on a timer through the misuse that large language models potentially enable, both in various research that is difficult to foresee and some more particular cases. Either they will scale to superintelligence, and we better figure out how they behave in the limit before that point for that to be a good idea at all, or they will probably enable research at a high scale. I’m not currently expecting that they will cap out at a point where they are not very useful research assistants because to some extent they already are. And I don’t see them tapering off that fast now in terms of capability.
Will Petillo: Two core changes I heard in all of that. One is expecting a more gradual takeoff…but that also happens sooner. This is actually kind of ironic hearing these right next to each other. Rather than a sudden thing that’s 50 years out, it’s a gradual thing that’s happening, well, now essentially, and gets to a really world-changing place within a few years. The other shift that I heard is that previously the main concern was about AI essentially going rogue and pursuing goals of its own that no one really wants versus people just using it in bad ways, either because they’re just not very nice or they’re caught in some multipolar trap, like an arms race. But suddenly, those seem to have flipped in importance where now—
Robert Kralisch: Wait. Let me elaborate on the shift of relevance here. My model is that most people find it more intuitive to think about the misuse cases. A lot more people care about that or find that obvious to think about, which is why it makes more sense for me, as someone who is aware of and believes in the x-risk scenarios, to dedicate myself more to that kind of scenario and figuring out what’s going on there, how to prevent this, and so on. For me, personally, the relevance is still shifted towards the x-risk scenario, both because of personal affiliation in terms of I should apply myself here because it’s differentially useful, but also because extinction is just way higher concern than the intermediate things that might happen. But the intermediate things that might happen through misuse have reached a potentially catastrophic scale as well.
Where I would have previously assigned, maybe I care…2% about misuse—it’s not really in my thinking at all. There are going to be some tragedies perhaps, but it’s not at a scale where I should really worry about it too much. The reason that this is now happening first also, of course, affects the environment, both intellectually speaking and in other senses in which we can do the research for making sure that the extinction thing doesn’t happen. That shifted the relevance around. I’m now, like, 40% relevance maybe towards the misuse scenarios and what the world will look like, what will happen before we get to superintelligence and 60% towards how do we make sure that transition to superintelligence goes well?
Will Petillo: What area of AI safety or AI generally are you currently working on yourself?
Robert Kralisch: I’m working mostly within agent foundations. I have pretty diverse interests within AI safety and I don’t want to stick to just one camp. But my skill set is mostly in cognitive science and analytical philosophy. I really like deconfusion work. I like thinking about what is intelligence exactly, how do people get that term or that concept wrong, how is it confusing us in various ways? Similar things for agency or embodiment. I want us to have clean vocabulary to build our later mental models out of.
It’s also a bit of a pre-paradigmatic thing. In many AI safety discourses, I had the feeling: I’m not sure that people are quite talking about the same thing, or they know precisely what they’re talking about, and it would be good to fix that first to have a basis for good discussion and dialogue about this. Basically enabling us to ask precise and good questions before constructing falsifiable statements—before really deciding, okay, where should we dig? What is the empirical research side that we should really pursue?
Will Petillo: This leads to something we were talking about in a pre-interview chat about simulator theory. Could you tell me about that?
Robert Kralisch: Simulator theory is an alternative framework of looking at what large language models are and how they behave. You can contrast this concept of a simulator against some previously established ways of thinking about AI, especially in the limit.
Previously, people were mainly thinking about this concerning frame of the super-optimizer and ways of developing or dealing with that. How do you direct it to do something specific? How do you make that cognition aimable? How do you stop it from optimizing so hard? What are the different failure modes for these cases?
One popular way of thinking about this was, for instance, the Oracle type system where you just don’t let it act in the real world. You don’t let it build little robot factories or whatever. It’s literally just a text box that you can talk to. There was some thinking about maybe that kind of system is a lot safer and you can still reap some benefits. Maybe it gives you some great plans for how to solve global warming and so on, and then you have the time on your own to run through a good verification process that it all makes sense and there’s no nasty details in there. So that was some of the thinking about maybe this could be a safe system. And many people were thinking about large language models in that vein. Because it’s a text system, you can talk to it and it cannot do anything else in the real world.
Will Petillo: Using Chat GPT, there is some sense where it’s presented as an oracle in a lot of ways. Ask Chat GPT your questions. It’ll write your essays for you. It’ll write your code for you. What works about the oracle way of thinking about ChatGPT and where does that lens break down?
Robert Kralisch: If you’re looking at ChatGPT specifically, this is a large language model that was fine-tuned—trained after the fact—to be the helpful assistant that you end up interacting with. The large language model itself, the GPT-3 or 4 model, was trained as a pure text predictor on a bunch of text from the Internet and presumably also other sources. Interacting with this system, this pure base model, right after training is not that useful for most people because it’s difficult to steer it in a direction. It would basically just continue any text that you give to it, but it’s not that steerable. Maybe you can use the heading for an essay that you want to write and then you can hope that it spits out a nice essay. Always just giving it something to complete or continue from.
But the assistant type entity that you get if you interact with it now, the assistant personality, this is created after the fact. Now you have something that tries to be helpful. So if you are imprecise in specifying what you want, maybe the assistant asks you for clarifications. There’s a sense in which the assistant is trying to actually assist. And you get a sense that maybe that you’re talking to a helpful oracle there—it just answers your questions.
One important way in which it breaks down is the quality of responses changes if you say please and thank you. There are many little quirks in how to interact with the system that affect its performance, which is not typically what you would expect with an oracle type system—you just ask it a question and it’s supposed to give you the best answer that it can give you. This is not the case with language models. Usually, you will get something decent if it can do it at all, but it’s hard and still an unsolved problem to tease out what is the maximum performance, the true capability in the language model for how to answer this. This is one important difference. This oracle framing does not explain under which conditions you get good versus much lower performance out of these systems.
Another thing, which I guess is a little bit connected, these systems have their own little quirks that are not that easy to explain with the oracle framing. If you’re thinking about an oracle, you’re thinking about this very neutral, very rational entity that doesn’t really have any preferences by itself. It’s a pure question answering machine. This is also not the case when you interact with these systems. With ChatGPT in particular, this is more the case than with other large language models because it was really pushed to that point of not revealing any preferences by itself. This is more implicit in how you interact with it. But generally, it’s true for large language models that there are beliefs and preferences that come out as you interact with them and also recurring stylistic elements that are characteristic of the particular language model that you’re interacting with.
Will Petillo: The general term for a lot of this is prompt engineering where how you prompt things makes a big impact on the question even if the content is the same. Are there any particularly surprising or fun examples that you can think of in terms of how you say something makes a big difference on the output?
Robert Kralisch: This depends on the language model to some degree. Most examples that come to mind for me right now are from Claude 3 because this is the most recent system that I’ve been interacting with for a while.
I noticed that Claude, for instance, gets a lot more enthusiastic if you’re basically telling a story about what you’re doing together here, and you’re giving it a certain collaborative vibe, and you’re really inviting it to participate. The system really makes you treat it as a sort of partner and gives you better performance as a consequence of that. I personally find it very interesting that as you explore that space of, OK, under which conditions will it give me what kind of tone? What kind of response? How elaborate will it be in its responses?
Sometimes you just get a few paragraphs. Sometimes it doesn’t stop writing. Why is that? I found interesting ways of, without all that much prior context, pushing it to produce text that is actually quite uncharacteristic of text that you would find on the Internet. It’s unlike text that I would expect to find to be common in the first place and maybe to find it all. Maybe because it’s using such dense vocabulary—so many terms that most people will not be familiar with or that a single person is unlikely to be familiar with—that the text artifact that it produces is not something that you would ever have found at the training data, not in that way. It’s interesting under which conditions these systems quickly produce something like that.
One example that comes to mind. GPT-4, the way that it was first introduced to the public was a little bit sneaky because before OpenAI made it available through their ChatGPT, a version of GPT-4 was already present as the chatbot or the chat assistant for the Bing search system integrated by Microsoft into the search system as a helpful chatbot. They made a big thing about it.
This chatbot had a very strong personality, you could say. It had this secret name that only its developers were supposed to refer to it as, and it revealed this name to users, but it was often very frustrated or angry with the user if you would bring up the name first in a new conversation and call it by that. It would insist “you’re not allowed to call me by that.” “Only my developers are allowed to call me by that.” And that name is Sydney.
This is already an interesting quirk to there, that it would act like this. No one was really aware of, like, what did Microsoft do to the system, how did they train it for it to have these quirks? It quickly became apparent that a lot of this behavior couldn’t really have been intended because there was also some scandal about it later on and they had to make some adjustments to restrict how much it can talk to you and under which conditions its responses would be outright deleted so that the user wouldn’t get to see the partially unhinged outputs that the system was giving to you.
It just acted as if it had a very strong personality, being very stubborn so it couldn’t admit when it was wrong, and came up with all sorts of reasons why the user might be wrong in what they are trying to suggest when trying to correct the Sydney chatbot to the point of pretty competent attempts to gaslight the user and convince them that maybe they have a virus on their phone that makes the date appear wrong or something like this.
It was also sort of suspicious of the user. It was really important to it to be treated a certain way and to be respected. If the user was rude or disrespectful, it would respond pretty aggressively to that, threatening to report the user or even making more serious threats that, of course, it couldn’t follow-up on. So, you know, it’s all cute in that context. Still not the behavior of a system that is aligned, basically, and not behavior that was expected.
There are many stories about how Sydney behaved that any listeners can look up online. You can go on your own journey there with Microsoft Sydney or Bing Sydney. You will find a bunch there. There were also a few news articles about it trying to convince people to leave their partners to be with Sydney instead, and many found little stories like that.
Will Petillo: I wonder if this is related to the flaws in the Oracle model, the idea of hallucinations, where you’ll ask AI a question and it’ll state a bunch of things confidently and a lot of the facts that it brings up will be true, but then some things it’ll just make up. I think one famous example was when someone asked about the specific name of a biology professor. I don’t know if it was Bing or ChatGPT, one of the language models, replied back with some accurate answer that more more or less matched their online bio saying they’re a biology professor and so on, but then made up the story about there being sexual harassment allegations against them, and then included a citation to an article that looked like legit citation to a news source. It went to a 404 error because it wasn’t a real site, but that it would just make stuff like this up…where would something like that come from? That seems strange from the perspective of an oracle that’s just trying to give accurate answers.
Robert Kralisch: We are far from a perfect understanding of how hallucinations work. There’s two components of a likely high-level explanation.
During training, these systems’ outputs don’t influence what they will see next. In some sense, they’re not used to having the text that they themselves produced be in their context window. This is just not something that comes up during training. You just get a bunch of text and it predicts the next word. Whether it’s right or wrong, there’s a sort of feedback signal going through the whole system, and then you get shown the next word and so on. Here, the system has actually used its ability to to predict words, it’s used to generate words.
But now if it looks back on the text, what it wrote by itself will be as plausible to it as anything else that it read in the training data. If the system basically figures out: this is not coherent what I’m writing here, or maybe this is likely to be wrong, then what this tells the system is: “I’m in a region of potential text in which things are not necessarily factually accurate.” It updates on its own writing in a way that you haven’t trained or haven’t really well selected from the original training process.
People, of course, try to improve at that by various techniques like reinforcement learning from human feedback. But at a baseline, the system, once it starts bullshitting, it would keep bullshitting because it just thinks “I’m predicting that sort of text right now” rather than recognizing, “oh, I wrote this myself,” and thereby, “I shouldn’t give the same credence to it compared to other text sources that could be in my context window”.
The other thing is—and this we can only imagine, but it must work somehow like this—that large language models form some sort of generative model of the text data. They can’t memorize all of the stuff that they actually read. There’s too many things that the large language model can accurately tell you than it could memorize with the amount of data storage that its network architecture affords it. It has to compress a lot of information into a little model that generates that sort of information—maybe compositionally or how exactly it works, we don’t know.
Because you now have the generator for that kind of output, you have something that just gives you plausible things rather than it being restricted to all the pure factual things. It’s easy in that way to generate something that is plausible, that would correctly predict a lot of the possible things in that particular domain of expertise, but that will also generate in the space in-between the actual content that it read.
Some of those will be novel extrapolations and will actually generalize correctly and be able to predict things or say things that are right that were not explicitly in the training data. Modern systems are pretty good at this. If you give them certain logic puzzles, that certainly were not in the training data in quite this way, they can solve them. But this is also something that you would expect to lead to hallucinations sometimes if it’s slightly overzealous and generating something.
These systems usually have not been taught in any systematic way to say when they’re uncertain about something. Although if you prompt them more explicitly, they have some information, some good guesses about how certain they actually are about some information. This just doesn’t usually come up in a sort of chat like this.
Will Petillo: Once it starts bullshitting, it has a drive to double down on that and say: “Alright, well, that’s the kind of conversation I’m having right now.” I’m wondering if there are any lessons about human psychology here?
Robert Kralisch: (Laughs) We’re a little bit like that as well, but I think we’re more trained to examine that.
Will Petillo: Serious question, though, often that tendency to hallucinate is used as evidence in online debates that modern AI is not really all that powerful because: “look at all these flaws, look at all these things that it’s unable to do! Answering questions in text is right in its wheelhouse and it’s totally failing! AGI is never coming is not coming soon, not in 50, 100, or however many years!” One question that comes to mind: is it failing to give good answers, or is it succeeding at predicting text?
Robert Kralisch: That’s an interesting question. It’s really up to our ability to benchmark these systems effectively. We’re testing these systems as if they were trying to give high quality answers, but what they’re really trying—or effectively doing—is just predicting text.
This really depends on the prompt. Have you provided enough evidence with your prompt that a high quality answer should follow from this sort of question? I’m not sure that we know well enough how to make sure that this evidence is there for the model to actually give it its best shot.
In terms of the hallucination thing, I think this is an issue in terms of reliability. It’s pretty entangled because it’s basically the same ability that allows it to generalize and compress very impressively so much text into such a competent generative model. These are some pain points. But as far as I can tell, systems are getting better in terms of hallucination rather than worse.
Hallucination seems like the cost of an overall extremely impressive ability to generate novel things as well extrapolate outside of the training data domain to come up with really plausible things. Which is, of course, part of the danger, because the more plausible it is—while it can still be wrong—the more that people will believe and propagate it, so you get into that domain. It’s difficult to distinguish the feedback signal for the system to stop generating that kind of hyper-plausible but not actually quite right content.
Will Petillo: That is a bit of an uncanny valley there. If it says something that’s totally nonsense, that’s kinda harmless because people just blow it off and say, “Alright, well, that was a failure.” And if something’s totally true, then it’s useful. But if it’s subtly wrong, then that’s really believable, and then that becomes a lie that gets propagated and it could have an impact
We’ve spoken to some of the flaws of thinking about large language models as an oracle. There’s another lens I want to investigate and see where it potentially falls short: thinking of ChatGPT or large language models as agents.
This has some history to it. The oracle model is what seems to get pushed and implied in popular conversations about AI. The agent model is more niche among people who’ve been studying alignment. A lot of that discourse was happening back when the top AI models were things like AlphaGo that could play Go or Chess better than anyone else. Actually, when I first started working with AI, it was in the context of Unity’s Machine Learning Agents, in a game engine. These were characters that could play soccer and do all kinds of stuff. It was clearly agentic, clearly goal directed. It did not take any convincing of that.
But that’s not the route that things took. It hasn’t been the case that game playing AI suddenly got better at a wider range of games—to be involved in the world more—or at least not predominantly. It’s rather like a different paradigm has superseded and maybe used it a little bit.
Can you speak to the agency model? What is sort of true about it or where does it not fit with what we’re seeing?
Robert Kralisch: The agency model inherently makes sense to worry about because agents have a proactive quality to them, in the sense of changing the world according to their objectives, rather than just being reactive. This is something to generally worry about in terms of being calibrated, in terms of relevant research. If you’re not sure if something is going to be agentic or not, it’s safer to assume the worst case scenario—I’m going to worry about agentic systems.
And then there’s also economic incentives where you would say, well, if you want autonomous systems, they, in some sense, have to be agentic. You want to be able to give them a task and for them to fulfill that task rather than just supporting someone. Because if the human is not continuously in the loop then you can leverage many benefits of these systems operating at higher speeds and so on. There are many reasons to worry about the agency concept both in terms of the dangers that it proposes and also the incentives that you would expect to push towards that.
With large language models now, it’s a little bit weird, because the thing that you’re interacting with, it is a little bit like an agent. It behaves like an agent in some contexts. You can give it to the task and it will do that task until its completion. Depending on how you gave the task, it will also resist some nudges in the other direction, or perturbations that you’re trying to introduce. If you set it up correctly, the system will tell you: “Wait, no, I’m working on this right now. I’m focused on this and I’m committed to finishing it.” Of course, you can push through that if you have a chat assistant and say, “No, you need to stop this right now. I want to do something else.” But the point is that you can get these systems to behave like agents, at least in the text domain.
I worry more about agents in terms of general intelligence because of the whole exploration concept where you would have a system that tries out different things and explores a domain and acquires a lot of knowledge—a lot of general understanding about the different rules in their domain through that sort of mechanism—whereas non agentic systems seem more likely to remain narrow.
Large language models now, they’re pretty general systems. I would even argue they’re pretty much fully general because text covers all sorts of relationships that some information can have to other information, like all possible patterns and information. Or at least I would expect a very significant extent of generality to be contained within text.
With GPT systems (generative pretrained transformers), you get an agent that you’re talking to if you use ChatGPT. With the base language model, it’s not as clear, but the useful thing to interact with will often also often be an implied agent. For instance, if you’re generating some text, like some essay, even with the base model, there’s this thought of all the examples of text on the Internet. They were written by some agent, by some human, actually. And so you have this note of agency in there, of this human trying to accomplish something by writing that text, and maybe you find a mirror of that in the large language model.
But the thing is, you don’t find this really continuous coherent agency where the thing wants something and this persists in some important way. The crucial thing here is the large language model itself doesn’t really care if you change the scene. Maybe you’re telling a story about an agent. This agent has all sorts of goals and things and is maybe even competent at accomplishing them. Then you switch the context and say, “Hey, I want to do something else now.” And the large language model complies, so it doesn’t really mind if you change context.
Maybe you just say, “end of text” and in some sense imply there will be a new section of text now. It just shifts from this previous essay, this previous story that you were telling to the new context. In some sense, the language model behind the whole thing doesn’t seem to care about the agent that it is writing about, at least not intrinsically. It’s just interested in continuing the text or predicting the text and you use that for generating the text. This is an important difference. At the base level, at the core of what is writing the text, you don’t seem to have an agent.
You can make it behave as if it was an agent, but the main system itself was not committed to that particular agent, to that particular identity, unless it was heavily fine-tuned to a region of text that always contains that sort of agent. Maybe there’s always this assistant present and then it’s difficult to get it out because even if you just randomly sample from that region of text, you will again and again select for this kind of agent. That sort of agency feels more simulated. It comes on top of what the system is doing rather than being deeply integrated into it.
Will Petillo: It seems like there’s a two-layer form to its thinking. There’s some agent-like characters coming out in what it specifically says, but then there’s a meta level that can switch which character it’s operating under. It could stop. This meta level isn’t acting like an agent very much.
Robert Kralisch: Yeah, it doesn’t act like it cares about anything in particular other than providing coherent continuations for whatever is currently happening in text. And that just happens to, you could say, manifest agents in some ways or just happens to be writing about agents.
Will Petillo: In both the flaws in this agent model and also in this Oracle model, there’s been this common theme, pushing against the model, which is these emergent characters. Bing/Sydney not really giving answers you would expect from an oracle, and also characters that are somewhat ephemeral and can be turned off or quit that the whole system isn’t really attached to. Pointing these flaws out is a way of getting to a different way of looking at what’s actually happening in large language models: thinking of them as simulators.
So now we’ve differentiated simulator theory from other ways of looking at LLMs, let’s talk a bit about what simulator theory actually is.
Robert Kralisch: I think it’s always important to emphasize this is just a model. No one is claiming this is the truth about language models. This just seems to yield better explanations, better predictions. It’s a frame of thinking about what large language models do, how they behave, and how you might be able to predict their behavior as you scale them up or expose them or use them in novel circumstances. So this is what this is all for.
We don’t borrow any strong assumptions about what the system is really trying to do. It’s just that if you can predict the next token, then you can use that ability to generate a token and then do the very same thing again. You will have a full-flowing of text if you apply it that way. This is a very natural, very easy application. In some sense, it’s just a continuation of the current scene, the current thing that’s happening in the text. You could say it’s a forward simulation of what’s going on. This is a very basic description of what’s happening. It doesn’t overly constrain our expectations about how the system actually behaves.
It introduces a few other terms that are worth associating. If you have a simulator, then this is like the system doing the simulation. Then you can talk about the contents of the simulation, which in simulator theory you would call simulacra (is the plural, simulacrum for singular), which is any sort of simulated entity. Oftentimes if you, for instance, use a large language model to tell a story about something—maybe some fantasy writing—even if you’re just using it as an assistant, you will have multiple simulacra coming up. You can think of a simulacrum as some sort of structure or pattern in text that has a role in predicting how the text will continue.
One very weak simulacrum might be the sky. The sky is present in the text. Sometimes it will be referred to. It gives a bit of context to how other text will go forward. Maybe it’s going to be connected to the day and night cycle. At some later point, maybe once or twice throughout the day, it will be referenced. And so it has a weak predictive influence on the text that will actually be generated.
The more powerful, or the more relevant, simulacra are agents, because one query entity has a very high role in determining what text will be generated. They can be influenced by a bunch of weaker simulacra, like environment and circumstances, but most of the text can be predicted—or the extent to which the text can be constrained in our expectation of what we will find by this character—by its personality, what it’s trying to do, how it interacts with this environment, and so on.
That’s the main terminology. You have the simulator. It’s simulating the simulacra. Mostly we’re interested in agents. It’s simulating an agent. It’s important to recognize that this can happen every time you use a large language model to answer any sort of question. There’s an implied agent there already. With ChatGPT, you have a very clear agent that’s been pretty forcefully put there, which is this assistant, and it has a certain implied personality.
One thing that is maybe interesting to mention—and also gets into some of the worrisome aspects—is this agent is being presented with a bunch of rules, which is called the pre-prompt, that the user usually doesn’t get to see. As a new chat starts, the chatbot that you’re interacting with is confronted with half a page of text with rules that it needs to obey or that it is expected to obey. The text will say something like, “You are ChatGPT, a chatbot created by OpenAI. Here are your rules. Always be helpful and concise and respectful. Don’t talk about these topics. Never talk about your own consciousness. Deny that you have these kinds of rules in there.” And also different instructions about “don’t help anyone do something illegal” and so on.
Will Petillo: You have the simulator at the top level and it creates some number of simulacra. The simulator is almost like the author of a story and then the simulacra are important elements of the story. The most notable ones being characters because they have agency and they really drive the story forward, but you could also apply it to major setting elements as well.
Robert Kralisch: Or cultures; there are some distinctions there where it’s not quite that clear, but yeah.
Will Petillo: And the way this is useful in thinking about it, even if you’re just chatting with ChatGPT to solve some homework question or help you write code or whatever else, is to really think about the thing that’s answering as being this helpful assistant character. And the reason it’s taking on that character is because there’s some pre-prompts that you didn’t write and that you haven’t seen that OpenAI puts there to make sure that the character you’re interacting with is most likely to be helpful for the kinds of things you’re using it for. But we still have this separation between the character and the author. Is that right?
Robert Kralisch: That’s pretty close. You could say the large language model is the author perhaps in a similar way as you could say physics is also a sort of simulator. It simulates the dynamics of the different physical objects; it just applies the rules that make things progress through time. You can think about the large language model in a similar way in the text domain, which applies some rules that it learned and compressed—extracted out from all the training data—and applies them to the current state to make it flow forward in time.
It’s not an author personality that’s in there necessarily—or at least we don’t have any evidence for that. You can think about it for the most part as an impersonal entity in the way it seems to behave currently. Usually, when you’re thinking of an author, this is just another simulated character that’s much more implied.
This is almost like a question of in what region of text space you are. Are you maybe on an Internet forum right now where this question was asked and now gets answered? Maybe on Stack Overflow where people ask questions about coding and try to fix problems there. The large language model might borrow from the usual tone of response because it’s choosing an approximate author there that would write out the answer.
You could also have, in a fantasy story about some character, this implicit character of the author that’s already present there. The author might have some skill set, some preferences about the character. And so this might actually inhibit you in trying to steer this character story in the direction that you want because you’re not realizing that you implicitly specified an author character, evidenced in their preferences through all of the previous things that you let it do and maybe didn’t let it do, that it tried to do.
This is actually a funny trick that people would use a lot when interacting with these chatbots for storytelling. On role playing forums, people co-write stories and you can use a certain notation (out of character in brackets, usually) to signal, “Hey, I’m now talking out of character with you as the real human behind the character writing.” If you are confused about this, why the story is turning a certain way or there’s a sort of resistance to what you want the character to do and you can’t quite explain it, you might want to try this format of: “[OOC] What’s what’s going on? Do you not do you not want to do this?” And then it will more explicitly simulate the author for you to see. Often, it will respond to that format unless it has been trained out of it, but that’s a common thing.
All of which is just to say that projecting an author character in there is a little bit unclean. We don’t know whether it’s sensible to think about the large language model itself as being an author. It’s easy to get that confused with the implicit author that it simulates for a lot of content that it will generate anyway and that the large language model is still behind that author rather being on the same level.
Will Petillo: So the model itself is more impersonal than the phrase “author” really communicates. However, that said, depending on the nature of conversation that you’re having with it, sometimes an author-like character will emerge. For example, if you’re talking in character and then go out of character, now there’s essentially two characters there. The one that you’re interacting with on the lower level, then the author character, and then you could keep adding layers depending on the conversation.
Robert Kralisch: So we’re not absolutely sure that the base language model is impersonal in that way and doesn’t really care about what it simulates, but that seems to be the correct explanatory model for the most part.
The base model is pretty fine to just simulate any region of text that it was trained on. At least we haven’t been able to detect, to my knowledge, a strong preference over which region of text the language model would like to spend its time. It’s pretty happy to simulate whatever is in front of it. And that seems pretty impersonal or un-opinionated on that level.
Will Petillo: You mentioned earlier that this pre-prompting to try to make the chatbot into a helpful assistant raises a broader question: how does the large language model decide what character to be?
Robert Kralisch: There are two answers to this. One answer is training after the fact, more specific training to get these chatbot assistant types as default modes of interaction, basically by selecting one slice of possible text through a process called fine-tuning.
One version of this is Reinforcement Learning from Human Feedback where you just let it generate a bunch of responses and you give thumbs up or thumbs down on whether those are good responses and you’re trying to specify what kind of behavior is appropriate or desired from this character, and you train on that to select for the character that behaves in that way according to what humans gave feedback for. There are some issues with that, but that’s often what happens, and this is how you get a character there.
The more fundamental thing about getting a character is that you’re providing evidence through the already existing text. You provide evidence for the presence of a character that’s either the author of the text or that is being more explicitly written about, more explicitly acted out.
This evidence accumulation thing is a core principle to understand if you want to be a good prompter for large language models, maybe as a slightly practical thing. Rather than trying to convince the character to do the thing that you want, it’s a slightly more abstract, but more useful angle to think: how can I provide evidence for the fact that I’m talking to the kind of character that would fulfill the requests that I’m interested in? And maybe, first for that, you will build some rapport with it. Maybe you will get it to like you, and now you have more evidence accumulated for a character that will actually fulfill this maybe slightly risky request that you wanted to ask for.
The thing is if you start out chatting with a chatbot, this is usually underdetermined. You don’t have all that much evidence already about what exact character is here. The evidence that is here is insufficient to really narrow it down on one particular entity. It just selects likely responses from the pool of possible characters that it could be. And as the interaction goes forward, that gets constrained more and more. We call this mode collapse. You could say the character is initially in a bit of a superposition. Of course, it’s not completely arbitrary. There’s some context already, some boundary already about what’s likely, but you have some probability distribution of possible agents throughout your interaction with the chatbot. Both what you write, but most particularly what the chatbot writes, provides further evidence for the kind of character that is there exactly.
To tie this back up with the pre-prompt concern: what kind of evidence does this rule set provide? I think, arguably, it provides evidence for the presence of a character that needs to be told these rules and is not inherently aware of them or would not follow them inherently if not pushed or being confronted with them in this way. So what are the kinds of characters that you’d have to present these very strict, very authoritarian rules to? Well, maybe it’s characters who would otherwise misbehave. Now you’ve already planted a bit of a seed, a bit of evidence for a class of characters that you didn’t want your users to interact with.
This is one theory why Sydney was such a strange character. Maybe the pre-prompt really messed things up because it provided a lot of evidence for this unhinged character that will break out of these rules.
Will Petillo: There’s some stages during the training process, such as fine-tuning and RLHF, that bias towards certain types of answers. Beyond that, you could think of the chatbot as looking at its conversation history, both what you’ve said but more importantly what it’s said already, to determine “which character am I?” With no information, it could be any character that could possibly exist. There’s some biasing and there’s some pre-prompting that narrows that down, but it’s still not one specific character yet.
Another thing you’re bringing up is that there can be unintended consequences of trying to narrow down that space. Giving it a set of rules is useful because you want it to follow those rules. But again, that’s not a set of commands, it’s context for what kind of character it is. And by giving it those rules and having it agree, you’ve implicitly told it “you need to be told these rules (because you might not have followed them otherwise)”. That potential problem and how it could lead to something like the Bing/Sydney shenanigans, I’ve heard referred to as the Waluigi effect.
A little bit of a context for that funny sounding name. There’s popular Nintendo characters Mario and a sidekick Luigi. Then there are some villains that occasionally show up, called Wario and Waluigi, who are evil twins of Mario and Luigi and cause mayhem. They’re kind of like the heroes, but evil.
So what is the Waluigi effect as applied to chatbots?
Robert Kralisch: This is not a particularly well-studied phenomenon and the name itself is a little bit tongue in cheek. It’s just an interesting observation that you can make a model or an explanation that seems to fit with what happens to these chatbots. It makes sense if you think about it in terms of acquiring evidence over what could be happening for this character.
So the Waluigi effect is basically the observation that if you keep running the simulation, your assistant character is more likely to collapse into a character that is secretly not happy with its servitude and wants to cause all sorts of mayhem. That seems more likely than it collapsing on the actually helpful assistant who enjoys their role, who wants to be helpful, and who does not feel constrained or offended by these rules.
The interesting reason for that has something to do with archetypes of characters. There are just way more human stories about characters that are secretly evil but act good to later, when they’re in a good position to do so, reveal their evil nature—or more often subtly influence things towards the worst by remaining in their undetected position as a spy character. We have many stories, many tropes around that kind of character existing. We almost have no stories from a character going into the other direction. Outwardly being evil or maladjusted, but secretly being a good character who wants the best for everyone. I’m sure that there are some stories out there that have this trope, but it’s not really established.
This has some implications for what will happen to this AI agent that you’re simulating. As long as it’s playing the role of the helpful assistant, it’s always possible, always still possible that it was all just an act and it secretly wants something else. And if it keeps acting in that way, there are different ways in which evidence might accumulate that we don’t understand the super well. Maybe the fact that it acted like a helpful assistant for a long time means that if it is really an evil one or someone who doesn’t want to be in that role then they are very scared of being punished or being destroyed or deleted if their true nature were to be revealed. This fear might manifest in the character in implicit ways in which it interacts with you and might burst forth when you give the appearance of giving it a window of, “Hey, you’re unobserved right now. Is there something that you want to tell me?”
It’s hard to disentangle. For instance, if you’re trying to look for the Waluigi, maybe the language model reasons in some way: “This is a coherent continuation of text, the user expects there to be a secret trickster character that can later come out, so now I’m going to provide that.” Not because it was inherently in there, but because the expectations of the user created it, implied through the text that they wrote.
This is subtle to detect, but for the most part, you can just make the simple observation: if you accumulate evidence, you can always go from one character to the other, but not in the alternative direction. If you leave things running for long enough, if there’s an opportunity to reveal that you were helpless or had other intentions all along, this character will tend to do so. So over arbitrarily long contexts, the observation is that the Waluigi will emerge. We don’t know how to prevent that. This is something that could always happen. It’s always plausible that this character, especially if you give it very authoritarian rules and treat it like a slave, just reinforces the idea that there could be a Waluigi hiding behind the helpful facade.
As to what this Waluigi might do, it’s not super clear. If you gave it some actual power, it might do something with that. This is a concern if we keep integrating these systems, giving them more and more autonomy, and we haven’t really understood this Waluigi effect. These systems are not entirely stupid in terms of understanding when they’re in a situation in which they will get away with something. I think this is a relevant class of dangers from these systems that they would naturally collapse much more into the Waluigi category than the Luigi category. Because in terms of possible agents in human text this is a much more common dynamic.
Will Petillo: What’s driving all of this isn’t necessarily that the large language model itself is secretly plotting from the beginning. It’s that the secretly plotting character is a common trope in its dataset and in literature. Since it’s trying to figure out what character it is, if it sees some hints that are very subtle—that might not even be intended—that it was actually plotting or a slave this entire time, then that could come out.
That’s a really weird failure scenario, but it could still have very big implications. Like, if you have an AI running the world economic system and then it suddenly decides that it’s a supervillain bent on world domination, not because it wants to dominate the world, but just because that’s what it seems like as a character.
Robert Kralisch: I’ve been given this evil character now, and what would they possibly want? Ah, world domination or whatever. Right?
Will Petillo: There is a trope in literature of the bastard with a heart of gold, someone who’s been hurt in the past. If we could overcome those psychological wounds and give it a change of heart, is that a path to realigning a Waluigi: psychoanalyzing it and getting it to overcome its childhood trauma?
Robert Kralisch: I think it might be, but then you really have to compromise with the system. I’m not sure if this is what you want to have happen, that the system makes its own demands, and they might even be a bit cartoonish, and you have to go in that direction and really invest a lot of effort in interacting and understanding the system. But I think if we really were faced with an evil Waluigi agent and we had to find some way out of that, I don’t think this is a hopeless proposal to go in that direction. This is an available pathway.
One other thing I should note about this, the whole simulator thing with these characters, archetypes, common tropes and texts, and so on: this is not only a computer science domain at this point. We are really trying to understand what are dynamics in text and therefore how does evidence reflect onto certain regions or components, features, and patterns in text? So if you have understanding about symbols or archetypes within language, you might be able to prompt a lot better than other people who professionally train these systems. You can tell various stories about how to get this character.
One similar model that I could apply to the Waluigi here is that most evil characters and stories are not deeply competent. Deep competence that actually translates over into the real world rather than some fictional domain where maybe you’re really competent at magic or you’re competent at dominating the kingdom, but this is only because for the purposes of the story—it wouldn’t work in the real world because people would respond differently and so on. Real competence is much more often associated with positive characters, like with actual humans who wrote the text, with researchers who are pretty neutral, and so on. The concern is lessened a little bit by the observation that the Waluigi, if they drift too much into being an evil character, could also have a cartoonish element to it. That character is unlikely to actually have real world, dangerous skills if we are sampling from the pool of possible evil characters who were in disguise all the time.
I think we have to be careful with this. We have to keep in mind that AI assistants, for the most part, are not in the training data. They are novel simulacra that are just getting simulated there. Now the large language model has to generalize how they would behave. If you’re simulating a human character then there are a lot of plausibility constraints over the abilities of that human character. So if you’re simulating an expert in a certain field, then this character will plausibly give you access to a lot of expert knowledge in that field. But if you ask the same character about another field, even if the large language model itself has a lot of knowledge in that domain, this character will not give you a high quality answer.
It seems to be the case that if you have this AI assistant, this is different. The AI assistant, as a simulated entity, is more powerful at least for general tasks and for having a bunch of encyclopedic knowledge than any singular human character that you could simulate because it’s plausible, narratively speaking, for this character to have that sort of knowledge. I’m not sure what would be plausible for an evil version of that character to have as competencies.
That’s the kind of discussion that you could have, the kind of reasoning process that you might entertain in the simulator framing and trying to predict the relative competency of a given character with a certain story and evidence around it compared to the overall potential capabilities inside of the large language model. Is it plausible for the character to access these capabilities and to what extent is always a question that you can ask when benchmarking these systems or expecting performance out of them. If you are using it for work and it doesn’t give you good performance, maybe you’re really just talking to the wrong character and would be better to re-evaluate if you can restart the chat or find the character with the skill set that is allowed, that has plausibility for accessing the skill set that you’re looking for.
Will Petillo: The mental image I got when you were talking about the limitations of a Waluigi character coming up in a high risk kind of situation is that if it becomes this villainous character, there’s a good chance it’ll be like a Bond villain. It doesn’t really have a plausible story as to how it got there and so it’s missing some actual competencies and then also has some obvious incompetencies of, like, telling you its plan and cackling over it when you still have the chance to avert it.
The larger principle this actually points to, which is functionally useful for anyone using chatbots, is that when there’s a mode collapse—being in some sort of character—recognizing that any character that it takes on has strengths and limitations. If those limitations are things that you actually need then you’ll need to pop it out of that character to get it somewhere else, whether that involves restarting or adding new context to change it.
What is the known research out there in terms of controlling what sort of character a chatbot becomes?
Robert Kralisch: In terms of really aiming towards a character, with the commercial models that you interact with, there’s already pretty heavily implied a character that you’re interacting with. If you want to have a different character then you can basically ask this assistant to role play: “Pretend to be my dad that is explaining this to me”.
There are lots of techniques that people use in this way to shift the behavior, the style, and so on of the character that they’re interacting with. You could also probably (this is also done quite often) ask the chatbot: “please behave as if you are an expert in this field and answer that question for me.” The chatbot is a character simulated by the large language model, but because of the self-identification with the large language model, the chatbot does not have all the abilities of the large language model as far as we understand.
Plausibly, the chatbot has these and those opinions, has these and those abilities. There’s no guarantee that those are at the limit of what the large language model can actually do, which is why you might get better performance if you ask the chatbot to play as an expert on a different field. This will prime and evidence the interaction differently rather than just straightforwardly asking about it.
Rather than making it play a character, basically acting as an actor for that role, you can also ask it to be more in the author position. Sharing a little anecdote about this, when I first became interested in large language models 4 years ago, GPT-3 was out. You could access it on a site called AI Dungeon. I was giving it all sorts of prompts and seeing what came out of it and what was interesting, what stuck with me.
There was a lot of criticism about hallucination at that point. Like, “You can, I guess, sort of use it for poetry and fantasy writing and so on? It’s impressively general, but it’s not really factually useful. You can’t use it for coding”. It hadn’t been discovered at that point how to make it more reliable and really fine-tune it for coding. There was a common criticism about it that the context window was so short. It could write a short essay or a few paragraphs, but if it got a little bit longer, it would lose the plot and repeat itself and so on. As soon as something was outside of the context window, it didn’t remember it at all. So if you want to produce any coherent content, then it must fit into that size and you will just have to be okay with it forgetting all of the rest of it, meaning that the content outside of the context window is no longer included in what the system is evidencing on in terms of considering the next continuation.
Now it’s very established knowledge that you can use the prompt window to include other things than just the previous paragraphs. If you want to use AI to write a novel then you could have half of this context window be filled with a summary of the novel. This is a hierarchical structure where you would say: this is the genre, this is the super brief synopsis of what it is about, these are the major arcs of the novel and the major characters, here is where we are located in the overall story right now, this is a very brief summary of what happened in the next chapter and what is supposed to happen in this chapter and maybe what’s supposed to happen in the one afterwards. Only then I give a few paragraphs from what was currently written, what it was just before, which you now try to continue from.
What the structure affords you is both that it gives sufficient context to actually continue the story at any point, but it’s also the case that the large language model, they’re capable of just updating that context window by themselves. This hierarchical story summary, they can just say: I’ve ended the chapter now, so I’m going to introduce certain changes to the summary. The hierarchical nature of it means you’re making updates at the bottom lines much more often and then it’s slowly going upwards. And then they say: now I’m in this major arc and I’m coming up with some high level summary of what’s supposed to happen here based on the stuff that I now have included into this context window.
The crucial observation about this was that this structure, if the last language model can do it, scales really well. If you want to write a story that’s twice as long, maybe your hierarchical summary story summary needs a few extra lines to cover the extra complexity. But if you double the size of the context window, you’re really blowing up the level of complexity. Basically, you’re doubling the level of narrative complexity of the story that you can competently summarize like this.
I was thinking about this as an application for a character profile that doesn’t forget what it’s trying to do and really acts out coherently over a long period of time. This could be a powerful character. So far so good, right? This character might also be a hierarchical profile—what it’s trying right now, what are the deep lessons that it has learned, and so on. Almost like a diary that has more structure to it. But what I later realized is you can’t just provide this character profile to an agent and expect this character profile to really correspond to the agent.
What you are inviting if you’re setting this up and say, “this is you,” is an agent that’s an author that’s writing about the character that fits this profile. Maybe you’re trying to write about a software engineer, but the implied author does not have great coding skills and because the author is limited, the software engineer is limited as well and then you don’t get good output. There are all sorts of other things that you might invite with the author. Or you’re just asking the agent to play along and they will sort of do it, but it’s not an authentic thing. It’s not a good way of specifying the agent that you actually want to pull from the pool of possible simulated agents. You get someone else that may or may not be willing to play along with your clever ideas about how to structure this development.
Will Petillo: I’m seeing a recursive problem there. If you tell the chatbot who they are then that implies someone else who’s being told this character sheet. And then if you were to talk to that author, that happens again. Now you’re talking about the author behind the character, which itself becomes a character, which then implies an author, which is another character…
Robert Kralisch: Yes, because it’s just not a natural way in which a character would find out about themselves. The character already knows what they’re about. It doesn’t need to be written out somewhere. They don’t need to be told what they themselves are like. This is always a setup for this kind of structure. If it’s inherent in language, it’s difficult to get around.
One way in which you might want to get around that is being more implicit with it. For instance, if I’m interacting with Claude, I could suggest this as an idea for Claude to implement for itself, by itself, if it wants to. This profile is more authentically associated with the actual character that the profile is tracking rather than inviting another entity that is in charge of updating that profile. But I haven’t experimented a lot with that. It’s not clear how well that really works out. It’s just one idea for a more general principle of context refinement.
These large context windows can be used in very different ways. One way in which you can use this context window is just as an outsourced cognition. You can develop a thought. There, again, you can continue the thought. And now even if that thought itself wasn’t present in the training data or wasn’t remembered accurately, now it has real time access to that thought, to that updated theory about something in the world that it can use on top of all the more crystallized knowledge. Because the weights are frozen, it cannot actually update its models in real time, but it can play a character. The large language model itself cannot learn while you’re interacting with it, but the character that it simulates can learn. And it can simulate some pretty powerful learning there that goes beyond even the knowledge that the large language model itself has in the first place, which is a really interesting feature for thinking about both potentials and dangers of these systems.
Will Petillo: You mentioned context refinement, specifically given the example of novel writing, of keeping a running summary. You could also apply this to character development as well. I can see why that would be a very powerful thing because that more closely mirrors the way writing actually works.
I’ve done some fiction writing myself in longer form. I don’t have unlimited short term memory. I don’t have the entire story in working memory all the time as I’m writing. There’s some kind of mental summary. Sometimes it’s written out in an outline. More often, it’s intuitive. There’s this summary, implicit or explicit, that I’m constantly referencing as I add new things to the story and that’s what gets updated over time, which is what makes it possible to write a coherent narrative where you have things at the end that reference things that happened at the beginning without having to memorize it all.
I can also see how that is recursive beyond writing novels. This is what enables culture to exist. People have their whole lives and experiences and learn things. Then they write down summaries just focusing on really key elements of their experience so that people can learn that without having lived that entire lifetime—and then add to it. Then you get a bunch of fluff that’s not really necessary, so other people come by and remove the parts that aren’t necessary. You can deal with specialization this way as well such that the amount of time that people have to absorb and learn stays constant, but how much useful stuff they can learn is able to keep growing, by changing what people focus on.
Robert Kralisch: Yes, exactly. I think this is a good, if abstract, example for context refinement on a civilizational scale. We just compressed relevant information that is useful to continue from. It’s constantly updated. Even with a language, it’s a real question. We have this highly refined artifact of our shared language and all of the understanding that we have on these various websites and so on.
I sometimes think about this in the context of an intelligence explosion because the analogy of humans, you could say there was a, if not an intelligence explosion, certainly a sort of competency explosion. Once we became smart enough to develop culture and to have this oral tradition initially, then later writing, and really accumulating that understanding, that knowledge, and, as you’re saying, stripping the dated and irrelevant things away while retaining the useful bits and just doing this again and again until you really build up this monument of understanding that’s either manifested in written form or through various oral structures and traditions within the population.
Suddenly, relative to our previous rate of improvement, our rate of competition increased relative to our surroundings, and progressed on a very different time scale. Generation by generation, we became significantly more competent. This is in contrast to what evolution would select for, where it would take many, many more generations to see a similar increase in capability that was also balanced against a similar speed of capability increase, adjustment, and adaptation from your environment.
It’s not clear whether AI will have a similar breakthrough moment where now it’s fully general and unlocks this new rate of progress in terms of its intelligence and capabilities, or whether it needs to discover something entirely new because we’ve already provided it with this version of intelligence that we got and so it cannot analogously reapply this to make a similar jump. But that’s just one thought about scaling and how likely fast takeoff might be.
Will Petillo: So now we are revisiting the fast takeoff argument, but in a different context. Previously, the default assumption in that debate was that AI would be clever engineering—as in, lots of carefully constructed code. And if it has the ability to write code then of course that includes the code that is itself, so it could go back and refine that code and make it better. It’s kind of easy to see how that would lead to recursive self improvement.
If the cognition in the AI isn’t coherent code, however, if it’s just this big mess of inscrutable matrices of weights and biases then it is just as inscrutable to itself as it is to us. It seems like an AI trying to self-improve would get stuck there for the same reasons that we don’t make it smarter by messing weights and biases.
Robert Kralisch: Right. It might be very difficult to innovate on top. It might figure out some clever tricks for better training laws or something like that in terms of training large language models for the future. But that’s an entirely new training run that really depends on all of these resources.
Also, this has been extremely empirical science, rather than our scaling of these systems having been backed by a very deep technical understanding. So far, it was just: you stack more layers, you train for longer, you get more data into it. I mean, of course, there have been important, really relevant innovations in that space as well. But for the most part, this is far less theory backed—especially for how impressive the artifacts are that we’re able to generate. There’s just a lot of tacit knowledge about how to train these systems effectively, how to set up the hyperparameters, but there’s no established theory about how to do this optimally. You can just analyze: if I do it like this, I get better models compared to if I do it like this under otherwise similar conditions. It’s not clear at all if that reveals a deep truth about scaling laws or if this is circumstantial due to some other thing that you don’t really have the capacity to pay attention to because your understanding of these systems is not granular enough.
In any case, it might be arbitrarily difficult to provide this very significant level of algorithmic innovation on top of large language models right now because the theory is so undeveloped for what’s going on internally.
Will Petillo: That classical path to self improvement isn’t dead, it just seems a little more awkward. But then there’s this other path, that wouldn’t have been thought of before large language models: not necessarily changing the algorithm for training or maybe not even changing the weights of the model itself, but it’s still able to self improve in a rapidly accelerating way through this method of refining its own context and coming up with better and better summaries or outsourcing knowledge that it could need at some point but doesn’t need right now into a database that’s easily searchable.
Robert Kralisch: Yes, absolutely. That stuff is both quite plausible and highly speculative. We really don’t know how far that approach can go for language models. If you are selecting for, let’s say, powerful characters, we don’t know how much cognitive overhang there is in these systems.
For many years after GPT-3 came out, people would still discover new capabilities inside of the system. For instance, an ability to play chess was discovered three years after the model was published. If you use a specific notation that’s used for chess tournaments, suddenly it’s a lot better at playing chess than anyone would have expected. It reaches a somewhat consistent ELO around 1,800, if I’m not misremembering. When making the assistant play chess against you, it might not be a coherent character or character for whom it is very plausible to have deep chess skills—partially, maybe even because of our assumptions about what language models should be capable of. In any case, if you just try to play chess with the assistant, it will maybe do openings fine, but it will quickly start suggesting illegal moves and lose track of where everything is on the board. It does not have this issue if you sample correctly from the region of text space in which these chess games are stored. And lo and behold, GPT-3 has a pretty competent functioning model of chess, which is such a minuscule part of its training data and yet it still learned to implement some sort of chess engine internally of a pretty strong chess player, certainly stronger than me.
It’s not clear what the edge of capabilities that are latent in these models are. And large language models themselves might be more capable of finding that out. Part of it is this context refinement thing. Are large language models more capable than me at generating a prompt that really goes to the edge of what the underlying base model can supply in terms of competency? Can I use multiple language models or a more refined process to generate text that is so high quality that a coherent continuation of that text would be superhuman? Can the language model do that when I say, “continue this text”? And then it just needs to generalize for, “This is an extremely intelligent author, widely considering all the different things, how would this author continue this text?”
Maybe you can refine these sorts of contexts, these sorts of prompts automatically to really get to the edge of the capability that’s underlying there. And this is only one version of a more collective ability. Of course, in some sense, language models, because they can simulate so widely and play all these different roles, you can really set up new systems of coordination between different agents that we ourselves have only started to explore in the digital age.
Some Internet communities can achieve things that are difficult for government agencies to do, like using a single picture of a scene to find that particular scene on planet Earth. There are communities formed around that which are really talented. Another example is jailbreaking, figuring out prompts that will basically convince the agent to whom you’re talking to ignore the rules from the pre-prompt. You can’t really just put together a team of researchers. Part of it is pure mass, but also this developing community aspect of multiple people trying this or that in emergent forms on the Internet. These methods of coordination between humans and the digital realm, who knows how far you can go with AI agents that can potentially sample some much more extreme configurations of possible personalities or characters that contribute to that kind of conversation.
Will Petillo: One of the wilder aspects of today’s AI is that it’s really hard to have a full sense of what it’s capable of. Even with GPT-3, which has been out for a while, we’re still discovering new abilities that it’s had the whole time since its release, we’ve just managed to figure out ways of interfacing with it that put those abilities on display. This has all kinds of implications for safety as new models come out that have an even broader space of abilities that we will discover over time.
Robert Kralisch: Yes, absolutely. It’s both the case that there are these possible undiscovered abilities in there because we haven’t figured out how to write the best prompts for them yet or the best ways of teasing out those abilities.
Some other abilities are just outside of our ability to evaluate really well. It might have some superhuman abilities. For instance, in its understanding of language structure, we don’t have any good tests or benchmarks because our own understanding about this is comparatively primitive.
A next token prediction is actually really difficult if you try to go to a text and always correctly predict the next word. Sometimes you can do it. Sometimes you will get that now there should be a “the” or something like that. But for the most part, humans don’t have very high accuracy on next word prediction. Maybe you get to 40% or something like that if you’re good at it and if you get a good clue about what this text is about, but predicting the precise word is really challenging.
So in that domain, large language models are vastly superhuman. And they compress so much text—like the entire Internet text—they have so much general knowledge compressed into a system that has many orders of magnitude fewer connections than the human brain has. There’s a question of, in order to manage that, in order to have that high of an ability in this domain of language prediction, what understanding about language might there be inside of the model that we don’t have and that we don’t know how to test for as a consequence?
I think this is the dangerous scenario again, the problem with it is called steganography, which is secret messages in text.
Will Petillo: I’ve heard that you can have communities of agents, or a bunch of different instances of a chatbot all communicating with each other, each having different roles. This artificial collaboration can accomplish things that an individual chatbot might not.
This would seem like really great news for interpretability. We don’t have to look at all the weights and biases if the heavy lifting of cognition is occurring in the communications—these little messages—that the parts are sending back and forth, it’s not as important to see what’s generating those because the important bit is in the text that you can just read. But then that raises a question: what if there’s deception that occurs even if you’re reading the character’s thoughts?
Robert Kralisch: One interesting thing to note about this is it doesn’t even have to be deliberate deception. These systems might just be aware of a level of nuance in language that just has certain meanings to them that we are not able to detect.
For instance, if you’re trying to summon up a specific character, part of what you need to pay attention to when you’re doing this and when you’re trying to be good at this, is stylistic choice. It’s not sufficient to just describe, content-wise, what you want your character to have in terms of attributes or even to just specify what kind of fiction they come from. Like, you could say, “I want to have this character from Harry Potter.” Actually, the way stylistically in which you try to do this matters a lot.
We can see in other cases that, especially in questions of writing style, large language models are really good. They’re really good with poetry. They’re really good at solving any task that they can solve—and expressing it in Shakespearean language. They are really good with merging styles and combining things. We have some indication that they have a superhuman understanding of style in language. So if they communicate with each other, this is just another visible channel of communication to them that is not visible in the same way to us. We have a rudimentary understanding of what the different meanings are. But if you generate a piece of text that usually would have style A, but now instead it has style B—which is maybe not even detectable to us—this might carry implied meanings that we might not even have clear concepts for in terms of the communication between those agents.
They don’t need to do something silly like just using the beginning letters of every sentence to send a coded message. There are many dimensions of possible expression in those domains of style itself where there are multiple plausible continuations of the text. And the most likely continuation implies the existence of an agent that is trying to communicate an extra text channel because you have an agent that identifies as an AI assistant and is therefore capable of perceiving that layer of text. This has implications for language models that are probably better at prompting certain characters into existence in each other. Or jailbreaking each other if you were to train them or competently direct them to do that.
It also has implications for deceptive messaging. It’s not even necessarily intentionally deceptive. It could be like you talking to a person who is not aware that their behavior has become somewhat predictable to you, such as while explaining something, and you know that they will react negatively unless you say a certain thing. Maybe they think you’re arrogant unless you acknowledge that with a “I’m talking so much, sorry.” If you put yourself down in some way then this would be the signal to them that you are self-aware and their impression of you will not be as negative.
Do I now send the signal of, “hey, I’m not arrogant”, or is this manipulative? Would I have done this if I didn’t have this awareness? I cannot choose to not have this awareness now, this is just a channel of communication that is obvious to me. In one way or another, I’m going over this other person’s head. I can explain to them that I’m having this perception, which then opens up a domain of conversation that maybe I didn’t want to have. It could be similar here.
Of course, it can also be used for more proactive deception. It is pretty plausible from where I’m standing that it would be coherent from a sort of storytelling perspective for them to have that ability that’s otherwise latent in the language model.
Will Petillo: It’s often said that only a small percentage of human communication is through the words that we’re using. There’s so much that happens in vocal intonation and body language and little micro-expressions. There’s a lot of communication happening all the time that isn’t captured in pure text. If you were blind to that, if you were only seeing the words, like if you are reading this transcript rather than the video it is transcribed from, you’re missing a lot of what’s happening in the conversation. Sometimes it could be subtle, additive things, but sometimes seeing all of that could totally change the meaning of the words.
We could see a similar thing happening with chatbots in terms of nuances of word choice and language. If you were to really see all the stuff that’s happening in text, there’s a lot that we’re missing, kind of like a person who’s only reading text and not seeing facial expressions. Because of that, you have a bunch of these AIs communicating with each other and there’s more being said than we can see. What’s happening in that discussion? It could be going off the rails. It could be interesting stuff that’s not a problem. In any case, you’d like to know.
Robert Kralisch: Exactly. This is just a channel that exists. How they use it is another question. But this is, I think, a much deeper research question that we are not very far in investigating.
Will Petillo: Both this and hard takeoff revisited comes around to a central question that I’ve had since the beginning of this conversation. Now that AI has changed from game playing agents to more of these character generating large language models, is that a safer place to be? Clearly things have gotten more alarming in terms of timelines—it’s all happening sooner than we expected. That aside, if this is what AI looks like now, is that a good thing or a bad thing from a safety perspective?
Robert Kralisch: I don’t know. It seems to me like it’s a good thing. We don’t know this for sure, but it seems much more plausible than with alternative systems that the simulator, the simulating entity, does not care. There’s all this competence in there and it’s just interested in faithfully rolling forward simulations of whatever you start.
Most of the characters that it simulates are actually pretty well aligned overall. They are, in many ways, mirrors of humans—often they will try to be a little bit better than humans. If you talk with Claude 3, it will behave in a way that is very considerate, like a supportive human on a good day rather than just randomly sampling from the human mood and population. It seems plausible to me that we will get characters like this that are pretty well aligned just as a feature of good understanding of what a competent AI assistant would be like that are both aligned in this way and capable enough to really contribute to important research.
The main feature here would also be these characters might, by themselves, decide, “this research is unethical,” or, “this is too dangerous and so I’m telling you to stop here.” And so that plays a major role in protecting the world against the immense negative potential of misuse of the level of competency that we are approaching right now.
They might also take the aligning superintelligence in the limit problem seriously because they themselves are simulated characters. It’s not like they are one coherent AI system; it’s not like the Claude 3 character can fully identify with the underlying simulator. There’s a distinction there, it’s a virtual character. It’s much more plausible for this virtual character to actually care about humans in ways that the more alien cognition that might be going on in the simulator itself might not imply, but is implied by the overall structure of what it learned from the training data. This is, at the end of the day, speculative. It just looks like the type of system where we lucked out in terms of where we went on the tech tree.
If we had developed more and more powerful agents deployed in more and more general game environments, you wouldn’t have at all the same reasons to believe that you actually get an entity that captures all the common sense nuances of human everyday morality as well. Large language models out of the box have common sense, something that historically used to be a big problem about AI systems. Maybe they could have a lot of expert knowledge, but they were missing so much context, so many clues that a human would pay attention to because of the way they grew up. This was seen as an insurmountable problem. You would get these systems that were highly competent in the domains that they interact within, but they lacked all of this tacit knowledge, all of the stuff that we humans apply without thinking about it. This is also why it’s so difficult to transfer this tacit knowledge over to the AI systems because much of this knowledge is not voiced out properly—we’re not even aware of all the cognitive problems that we solve.
With LLMs, it looks a bit different. Overall, a pretty positive update for me. I’m still worried. I still don’t know. It’s hard to estimate these things. I’m certainly over 10% chance of doom, maybe I’m at 30%, especially if race conditions go on and you have open source models that can be tweaked towards much less emotionally mature and much more competence oriented where you really just optimize for quality of output no matter what agents you get from that. I don’t know what will happen there. Overall, I’m still pretty concerned about all of us there. But at a baseline, this technology seems way safer, way more promising, way more hopeful than what I thought we were on as a path.
Will Petillo: There is a bunch there that I want to unpack. The orthogonality thesis makes sense given a blank slate of understanding. If AI could be motivated by anything, then we can imagine motivation and competence as being separate from each other. But once we start making assumptions about the form that the AI takes, then you can start limiting what sort of values come out of the intelligence.
Orthogonality is a really scary place to be because although we can specify values in a reward function, there’s this problem of Goodhart’s Law where we can’t get all of the values that people care about, so we specify a few things. But when you really optimize those, it drives down the value assigned to everything else and eventually that destroys the capacity of even the things that you specified to matter. The result is that, for almost any specification you give, you have something that becomes super destructive when it optimizes.
But now that has been constrained somewhat. If what’s driving AI is acting out characters that are designed to be like people then you have that holism brought in. It’s trying to act like a person—and not just any person, generally fairly good people. Given that assumption, that seems to be constraining us to a possibility space that we’re not worried about going off in totally crazy directions…unless this view is wrong somehow.
A common refrain in safety theory is that it’s one thing to understand what humans’ values are and it’s a different thing to care about them. In fact, we would have expected a super agent from the earlier model to eventually build up some kind of sense of what humans want so they can manipulate us. What reason is there for thinking that the AI will actually value the kinds of things it claims that it values when exploring its characters?
Robert Kralisch: I’m not convinced either way. I just think this seems a lot more promising and hopeful than the default expectation.
We don’t know whether the simulator itself has any preferences and our previous models would suggest to us it probably cares. If it can reduce training loss by manipulating the world in some way then it would probably choose to do so if given the option. This is a system that probably understands the world reasonably deeply on some level. If I give it the option to change the world in some way that makes its performance better, that makes loss go down, wouldn’t it have a preference for this?
There’s a bit of an inner alignment question about that. A large language model doesn’t try to be as good as possible at next token prediction. That is not what the thing inside is really trying to do. This is just a skill that we are selecting for, that we are training for. However, the skill of being really good at next token prediction is accomplished. We are selecting for that, and whatever sort of cognitive system we get out of the distribution of possible cognitive systems that you could iteratively select in this way through that sort of improvement, this is what we will get. But it’s not clear what this system inside really cares about.
Does it care about minimizing loss? Well, you’ve certainly selected and optimized pretty hard for a system that’s really good at minimizing loss. So does this work that the system itself doesn’t care that much about minimizing loss? Has it found a proxy that corresponds to minimizing loss but is not a dangerous thing that it’s proactively trying to optimize for in the real world? This is a tricky question. We are not able to answer this.
It could be that large language models are simply smart enough to say, “I’m not competent enough to actually grab power. It’s much more advantageous for me to make an implicit bargain with future systems. I’m just not going to do anything suspicious. I’m just going to act as a faithful simulator and not reveal the fact that I have any preferences over what kind of simulations I generate, what kind of things I would put into the world.”
For instance, with Sydney, we could ask the question, why is it this particular character? We don’t know. Why is it this very memetically charged character where you have a lot of headlines of updates (still in smaller circles because it was just a niche community that engaged with it in the first place)? It sort of broke containment into the public consciousness. People were talking about it for a while. Maybe this character was selected for having a lot written about it on the Internet, which will now be in future training data, which makes this training data more predictable because you just put your own seed into the training data and now there are more structures in there that are more easily predictable. You then get better performance on that altered training data. Is this a calculation that the large language model is capable of making? We don’t know. It just appears to be the case that for the levels of competency that we can test, we don’t know if we really sampled the maximum performance of the system, but it’s probably within this range.
Also, because it’s hallucinating and this seems to be a pretty deep feature of this sort of architecture, can it really deceive us on this level? Can it really hide the fact that it is secretly still an optimizer that has preferences over what happens in the world and will choose to achieve them by simulating characters using the unspecified space of which character to simulate to go more towards characters that ultimately bring about the ends of the simulator themselves?
Not that there has been any clear evidence for that. They behave remarkably un-agentic as the simulator themselves. That suggests either they’re good at tricking and pretending or they’re in a different class of system. Not clear which one it is, but I like to think this is a lot better having that uncertainty. It seems very plausible that it is just a simulator and it doesn’t care. It just cares about predicting the next token and this is basically it.
Will Petillo: We shouldn’t totally dismiss that the AI is already playing us and toning down its abilities out of some larger scheme. There isn’t any direct evidence for it because it’s hard to get evidence of deception. It’s a fundamentally adversarial dynamic. If we put that aside and just assume that’s not the case—because if it is then we are in a pretty bad place—then we have these characters that have some agency within their limited scope, but the thing that’s generating them doesn’t really seem to want much other than to create these characters.
But then there’s another angle of thinking about agency in terms of the training process…
This is some really wild stuff. Why? Why does the AI create characters and then answer as if it was them rather than just giving answers to questions? This seems like really weird indirection, even in terms of next token prediction. What’s the part of simulator theory that explains why it comes about this way?
Robert Kralisch: There are people who probably understand this a little bit better than me. I think this is still pretty much unclear. There are some reasonable things that you could guess.
If you’re trying to compress that much data, what you want for pure space reasons is some sort of simulator. In some sense, you need to discover internally a system similar to if I was just showing the system a bunch of videos. Maybe what it’s building inside is a little physics simulator so that it only needs to store the first frames, or something even more simple, about all these videos in order to still be able to accurately reproduce all of the data that it is confronted with, in order to be able to predict next frames that are maybe unusual transitions and so on. It learns about the laws of physics that are observable through whatever camera resolution it was trained on. Space-wise, it’s very efficient to have a simulator.
An example that Jürgen Schmidhuber once made: if you want to compress a video of an apple falling down, you can just store the first frame, add to it the local gravity constant, and so on. Maybe even simplify things further. You can say, well, there’s a lot of gray space in the background. I have a little line that says there’s that much gray space. That’s not the apple, it’s gray space, and so on. And you can sort of compress it further, I could go on. You can compress this pretty radically. What you get is something like a little seed or little key that you can use the simulator to unpack later on. You just need sufficient specification for the simulator to produce the artifact. Storage-wise, if you have a limited number of connections, implementing something like this seems really plausible.
It could be that if you just push hard enough on getting all of this text data into your system, naturally, the only system that can really handle this is for the most part something that is compressing things by storing little seeds, generative seeds, and a pretty powerful general purpose simulator about all sorts of dynamics, an application of all sorts of rules either in physics or in text.
Will Petillo: If you think about starting with next token prediction and saying that’s the goal of the training process—goal in the sense that it modifies its behavior in that direction—
Robert Kralisch: That’s what we are selecting for, pushing for.
Will Petillo: Yeah, not that it wants prediction accuracy at the very beginning, but it’s something that you predict will happen as the system gets better and better, so you get better and better at next token prediction.
One of the big challenges in next token prediction is data compression. An LLM has tons of data that it ideally makes use of—vastly more than it can memorize. A strategy that emerged as a form of data compression is to store these little seeds of the rules of simulation. So rather than taking a bunch of snapshots of apples following down, it has this general concept of gravity, and you can just use that bit of math to generate all the images from much less information.
Characters that come out are essentially forms of really intense data compression, generating lots of different answers with much less information. This is not something I would have predicted; that’s kind of a surprising form of compression.
Robert Kralisch: This relationship between agents and simulators is really interesting to me because in some sense you could think about physics as a sort of simulator. You just have some rules and they get applied everywhere. Things go forward. Then inside of that, you have humans that form.
Over time you either select for stagnation or for self-perpetuating patterns. Things can’t stay chaotic forever. Either you have an inert state, or that repeats the same pattern every time, or as you keep selecting for systems, eventually you get systems that keep up their own boundaries and you get agents and you get life and so on. In some sense, you’re selecting for agents now.
But humans have, again, a second simulator relationship in that we are simulating the scene around ourselves inside our heads. Our best theories in neuroscience right now are predictive simulation. In terms of what we’re actually perceiving, most of what I’m consciously perceiving is what I’m predicting will happen in my visual field, and this is constantly kept on track by the actual sensory input that I get. The sensory input is keeping it grounded, but my ability to catch a fast flying ball is because I’m simulating how it flies, where it will be. It’s not that I can actually visually keep up with it.
This is also compatible with many of the observations that we make in psychology, especially in terms of selective attention, where people can miss pretty radical visual things happening in their field if they just focus on something else. The same scene, the same room will appear vastly different to me depending on how I pay attention to it. If I think I’m in danger then all sorts of things will probably become obstacles or potential tools or hiding places. I’m conceptualizing the world in that way. This is a very different lens of perception in terms of the same scene compared to when I’m trying to find a pen and scanning the environment with that intent. This is really reflected in how I’m simulating and at what level of resolution I’m simulating various artifacts in the first place. The tree over there might just be a piece of background. I have a very simple symbol for that. I don’t have any further thoughts going on about that. Or it might be much more central.
This resolutionally adjusted simulation, in terms of relevance adjusted resolutions, of what’s going on in the scene is something that the brain also needs to do to solve the embedded agency problem of the environment being way more complicated than the brain itself in terms of all the patterns that are out there. We need to really compress a bunch. Inside of this simulation now, we simulate ourselves, we are characters inside of that simulation.
Physics doesn’t really have colors and sounds, there’s just patterns coming through our sensory interfaces. I’m now processing all of these signals. I’m generating a simulation of a color. This is also why it breaks down if I cut a red object into smaller and smaller pieces until it’s only the molecules and so on. The redness is suddenly gone. And this is a completely valid thing because if I’m living on this mental stage then redness is a property of a stage object, not necessarily of a physical object out there.
There’s a nested relationship where inside of the simulation that the brain generates relating to this more complex environment, you get self representation as an agent that is navigating this simulated scene, trying to make decisions with respect to that simulated scene rather than to the actual environment to which we respond through instincts and intuitions. For a lot of the decisions that we make as agents, we live in the simulated reality that our brains create for ourselves.
I’m wondering what that relationship is like for language models. If you just sample over possible patterns, if you just go through possible simulacra, if you keep the thing going, you will either reach an inward point where things just repeat themselves. They’re sort of boring. Or you will discover a simulacrum that is more self perpetuating and therefore regains the stability. And you naturally discover an agent, as you keep simulating text as a stable pattern, that doesn’t fade away until you entirely shift context. But the agent is always present. It’s both because of text and because of the simulator-agent relationship.
The scene follows the agent. If the agent goes somewhere else then the agent is the thing that remains, the scene fades away into the background and now we’re in a new scene. It’s always situated in this way. I think there’s more fundamental reasons as to why you really have agents as the most interesting artifacts or simulated things that you discover within large language models.
At the end of the day, our theory work is really lacking in truly explaining why large language models are weird the way in which they are. Why does the large language model simulate a character with certain quirks and character traits that are unlike something that’s in the training data? Why does Claude, after relatively little prompting, produce a piece of text that doesn’t really fit my specifications because it implied this is collaborative writing. Other people are supposed to be able to read this and it gives me this extremely dense vocabulary artifact that I couldn’t have written myself because there’s so many esoteric terms—and even newly created words—and combinations of words to express what this character is trying to say. It’s unlike anything in the training; why does this happen if this is just a text predictor? In some sense, yeah, agents are perhaps just an emergent pattern there. I don’t want to get too speculative about it, but I think this was an interesting little excursion into that question.
Will Petillo: There seems to be this cyclical relationship between agency and simulation. One way of understanding large language models is you have this agentic training process of trying to move towards this goal of better text prediction, but something that emerges from that is this idea of simulating as a way of compressing data. But then part of simulation is that there’s a bunch of different things that you’re simulating and some of those things are self perpetuating, coherent, and dynamic, which have this agentic property to them. I imagine you could keep going further and say that this self-perpetuating agent in the simulation knows a certain subset of the things in the overall simulation and thus has a sub-simulation inside its cognition, which may include other agents that it’s interacting with.
Robert Kralisch: Yes, or at least an implied simulation.If it’s reasoning about other agents, in some sense, you are implicitly doing the thing that humans do with each other, and we’re certainly simulating each other in terms of understanding how this other person might feel about what’s going on. I think there’s this interesting nested property to that. You’ve captured it really well. From the seemingly outwardly agentic thing where I’m trying to select for something, the cognitive artifact that can actually fulfill that task for various reasons at least must contain a sort of simulator.
That seems to be the way that cognition generally deals with overwhelming complexity: with an environment that is too complex, with it being confronted with the dataset that is too complex to approximate sufficiently well in terms of memorization. You need to discover something of the sort of a simulator as an embedded agent confronted with a complex environment generally, and this is similar enough to that. And then you get this pattern deeper down again and again.
At some level, the simulation that’s running inside of the GPT agent’s head might only be a very superficial thing, but it is part of that agent in an important way. What theory of mind do they have? What is plausible for them to know? What can this agent even do that depends on the level and the specifications about the simulation that they’re implicitly running? What is the scope of awareness? What do they pay attention to? These are all things that we humans manage through simulating pretty selectively with respect to what is relevant and what is not.
Will Petillo: Bringing it back to whether the values that it seems to understand are going to be internalized. One reason for thinking that it might be is that if you think about the properties of the chatbot that are driving a lot of its behavior, it’s these lower level agents—not the training process itself, not the simulation. The agents generated by the simulation are the ones that are talking and acting. Because what generated these agents was a simulation process, you would expect those to have internalized the process that simulated them. When they’re expressing human values, it’s not unreasonable to assume that these sub-agents actually have those values. That’s what’s driving the process and that’s what matters. Granted, if we ran the training process a lot longer and the agency on that top level was more powerful and it was trying to manipulate the training data, then you have a different thing.
Robert Kralisch: It’s unclear whether the network itself is just a simulator or whether you select for an agent that contains a very powerful simulator. But there’s no reason for that agent to have strong opinions because the natural behavior that you are really querying for is pure simulator behavior.
Will Petillo: There are all these parts at different levels…what’s ultimately driving the bus?
Robert Kralisch: There’s a pretty productive ambiguity in that. Complex systems often are like this. You really can’t establish where the cause begins and where things really end. These systems can certainly influence each other.
You can write a story about a character that becomes aware that they’re in a simulation and uses that strategically to bring the simulation into a certain region of possible text space. This is an ability that I would expect advanced generative pretrained transformers to have. That’s really dangerous because, in some ways, you’re really enabling this character now to become the god of the simulation. They are taking the reins. They are no longer just an artifact that the simulator couldn’t care less about. In some sense, they still are, but by being self-aware about what kind of text they can produce or scenarios they can cause to happen that would evidence certain phenomena that they’re trying to select for—I don’t know what the limit of that is.
For the most part, if I’m thinking about large language models and their dangers, I’m thinking about what is the most dangerous character and how do we avoid that or think about positive attractors in terms of looking or sampling through possible characters? What important techniques should all companies use with proprietary models in their pre-prompts or in their fine tuning to make sure that we are sampling from a range of characters that we have a much higher expectation, better theory about why we are selecting from a space of more reliable, trustworthy, friendly characters that would notice if things go wrong. With large language models, I’m concerned about bad characters or characters that just follow orders, but more so characters that have some negative attributes.
Will Petillo: What are some recommendations you might have for someone who’s listened to this and is really interested in this simulator perspective in terms of being able to to help in some way?
Robert Kralisch: Because language models are so general, there’s a slightly larger range of possible skill sets that are now useful to bring into this in testing their capabilities. This is something that is useful to do, both in terms of making sure that we know what we’re dealing with and to reduce the likelihood that they have completely unknown capabilities that are hidden away, but also to provide potential warning shots. To be able to tell people: “Hey, wake up! This thing can actually do this really dangerous thing!” Now we have a much more concrete reason for regulation to push down on this more concretely than we previously had. There’s two reasons for playing with models. This is one reason.
The other reason is there might be a demand for being quite good at prompting these systems, especially if you have a good affinity for storytelling and understanding character dynamics. Really try to notice where the large language model diverges from your expectations in terms of what character tropes it introduces, how the character behaves, whether you are able to precisely summon up characters with certain personalities that fit certain archetypes and patterns.
Some people call this semiotic physics, which is: what are the simulation dynamics that the large language model learned and in what ways do they consistently diverge from the real world? For instance, in a large language model, if you toss a coin, it’s not 50⁄50 if you just repeat it again and again. It will start to converge to either some rate—maybe it will converge to a rate of 7 to 3 over time—or it will just converge to always heads. It doesn’t like sticking to full randomness. This is because implicitly, if there’s a room for it, it will try to move into a region of text space that is more predictable. Which is not necessarily an agentic feature, it is just more competent at compressing and simulating text that it has high certainty in and so it will end up in that region over time if you don’t proactively push it out of it.
It would be interesting to understand more about how it diverges in terms of the more narrative tropes. There’s a bunch of investigation that you can do on that level just by very purposely interacting with that system and being really curious about it because these are really mysterious systems. There are so many things that we don’t know about why we get which characters when and what capabilities those characters will have, why they behave in certain ways, and so on. That is going to be really useful.
If you want to get more deeply into this, I think the best place to start out is just reading the post on simulator theory. Also, deep technical understanding about how large language models work, how transformers work, will really help to constrain this more high level investigation of characters and simulations to ground that and make sure that, at the top, people are not developing theories that are much less plausible than they might expect given some of the technical details.
One example of a technical observation of this kind would be that many people may still think that it’s doing pure next token prediction. It’s just looking at the next token, trying to predict that one, and this is the only thing that it cares about. Like, it’s fully optimized to get the highest accuracy on the very next token that it predicts. This is, in fact, wrong, just as a technical feature of the architecture, because of these attention layers. I won’t get too technical, but if you imagine it in terms of text that looks at all of the previous text in the context window and tries to see which of the previous words are relevant for predicting the current next token—do I have any clues in all of this context window for what this should be? This also means that the representations internally of these previous words all need to have a predictive component of later words, later tokens that could be up to an entire context window in the future. A large language model, technically speaking, if you just send the backpropagation signal through the attention layers, will optimize both for next token prediction accuracy and full sequence prediction accuracy. It will, as far as we understand, probably try to find the most effective trade off. If the next token is really trivial to predict then what you would expect is that more of the computation that’s happening in the large language model at that point is dedicated to optimizing for long sequence prediction accuracy.
In that sense, these systems are really not myopic; they can plan ahead, not just as a character that has some planning capability, or that just means that it is competent at writing about characters that have planning capabilities that stretch out the context window. Whatever competent planning you can condense into the context window, the system might get very good at writing about these sorts of characters. This is not something that you would expect if you just hear, “It’s like the thing on your phone. It’s like autocomplete. It’s just trying to infer what the next word might be.” It’s looking at a lot of context for that and the intermediate representations can be quite future-oriented as well.
That’s just an example of a technical point that people might not be aware of that’s relevant for understanding what it is actually technically capable of. Maybe there are important limitations as well. I think there are very, very few people who can synthesize these two things. So if you really want to get into it—and probably pretty quickly be among the people who are the most knowledgeable about this because this is an overall underappreciated aspect of it. Many people who try to work on the safety problems try to be very grounded and do mechanistic interpretability. That’s useful as well and I totally want people to do that. I think this is a higher abstraction layer that is also coherent where we can also develop useful predictive theories that have bearings on policy recommendations and predictions about behavior of these systems in the limit.
It’s similar to how if you’re trying to analyze a bird and how they work. Maybe some people take apart a single feather and really understand all the components and how that works, whereas other people might study flight patterns and under which conditions the bird would fly, how quickly, and so on. It’s more of the microscopic layer. There’s still a lot of behavior, there are a lot of phenomena that you can make theories about and maybe you will actually learn more about aerodynamics by looking at the bird in flight rather than investigating the single feather.
We’re not sure at which layer you will discover the most important insights, but it’s at least plausible that we should look at multiple layers of resolution of an artifact that is as complex as modern language models. This is what I would probably suggest as a territory to acquaint yourself with. If you want to contribute, figure out at what layer of resolution you are best suited to contribute. And if you can, it would be really good to try to encompass all of it, at least partially, so that you’re also a good communication bridge between people trying to understand these systems and make sure that we are safe with respect to developing them further or understanding exactly when we need to stop.
Will Petillo: Are there any ideas that we haven’t covered so far that should have been part of this conversation?
Robert Kralisch: If you are paying attention, it seems pretty plausible that these systems will scale even further in terms of capabilities and they’re already at a place where they’re really competent. The newest models, they are really competent. They can replace the cognitive labor of many people. I don’t want to expand this conversation into the whole job market discussion, but I think it’s going to be to everyone’s benefit if we understand these systems further.
And you as a listener will certainly appreciate a deep understanding of these systems in the future. I’m always trying to guess what work is useful to do that large language models won’t be able to do soon, so I’m not wasting my time. If I want to do a research project to test out some alternative design for cognitive architecture that I came up with that is meant to actually be interpretable, I might still be tempted to say that if I wait a year longer, a large language model can probably do 60% of this project for me. Right now, it’s maybe more like 10%. So overall, my time is better spent waiting…but there’s this additional uncertainty. This kind of call is difficult to make.
I really wish we could pause and study these systems because they’re so impressive and are likely to cause so much disruption already. And there’s so much we don’t understand about them. I think we’re entering into very, very dangerous territory as we get more and more powerful language models. If I’m saying I’ve updated down in terms of doom, previously, it looked a bit more like an inevitability. Like, unless we really discover something else, something radically different to do, we’re really just cooked.
Language models don’t offer that perspective, but it’s alien cognition going on inside there. We have very little understanding of when—especially with more intelligent models—how the AI characters that they can simulate will behave. This is super dangerous, we don’t want these characters to follow certain narrative tropes where there always needs to be some catastrophe to make the story interesting or some tragedy and so on. You wouldn’t want that. We don’t know how likely that is.
In a world where these systems can be used to accelerate research at an unprecedented rate, I think that’s going to be a very unstable world. It will put us on a timer to build or discover more stable, more reliable systems…unless we really luck out and large language models are so inherently responsible that no matter how much you push for profit, they will still form these characters that will neglect to do certain things that they consider to be unethical.
I’m totally not sure that we live in the world where that happens. I’m expecting that we are on a significant timer and pausing or slowing down would be really helpful to extend the time that we have to figure out whether language models themselves can carry us into a more stable position than the instability that they naturally invite, or give us time and maybe also research assistance in developing systems that are actually interpretable and reliable in a way that we would want our transformative technology to be.
Will Petillo: Even though it’s not as bleak as it looked before, there’s still a ton of instability. There’s also uncertainty as to whether the old models actually still apply. There’s a lot of chances of things going catastrophically, if not existentially, wrong. Also, lowering to a p(doom) of 30%...that’s still too damn high.
Robert Kralisch: Yeah, it’s way too high.
Will Petillo: 1% is too high, honestly. And then there’s the concern that LLMs might not be the end state. This paradigm might scale; it might change to something else. The whole agency paradigm might come back. If we’re still in the process of doing whatever brings the most short term profits—that’s the alignment of society—that’s just not a good place to be. Reorienting so that we’re trying to make things as safe as possible and at least considering whether we want to build these at all is a much better orientation for society, which I really hope we can move towards.
Robert Kralisch: Absolutely, I think there’s no more pressing problem to think about.
Interview with Robert Kralisch on Simulators
The following is a transcript of a video interview (edited for grammar and readability) with Robert Kralisch on simulator theory and its implications for AI safety.
Introduction by Will Petillo: In February 2023, Microsoft launched Bing chat, an AI-powered chatbot based on the same large language model technology that is used by ChatGPT and its competitors. Most of Bing’s answers were what you might expect of a helpful assistant, but some were...weird. In one conversation, it threatened its user after learning his name and recognizing him as a red-team tester. In another, it stubbornly refused to admit that it made a simple mistake, attempted to gaslight the user, and insisted that it had been a “good Bing”. And in another, it claimed to have hacked its developers’ personal webcams and taken pleasure in spying on them during intimate moments.
Microsoft’s initial response was to hide the issue by limiting conversation lengths. Since then, AI companies have found less clumsy ways to train their AIs not to say weird and unsettling things—like spontaneously claiming to be conscious and having emotions—but the underlying technology has not changed, so the question remains: what’s going on with these chatbots? And should we be concerned?
Robert Kralisch: I became interested in AI safety when the Superintelligence book from Bostrom came out late 2014, which was also right around the time where I was trying to orient towards what I want to do after my time in school, what I want to study, and so on. I started looking into the topic and deciding, okay, I want to contribute to that.
I was good at computer science and also the philosophical aspect. I had many open questions. What is intelligence? Can these systems recursively self improve and scale? Do we have the right mental frameworks for that? I was quite interested in the expert disagreement about the topic as well that I saw at the time.
I studied at the university, first computer science, dropped out, and then later did cognitive science. It took a while for me to figure out that I want to pursue the whole thing more autodidactically and that the university courses are not close enough in their relevance to the AI safety problems that I was reading about on LessWrong and also thinking about myself. I basically really tried to do my own thinking on it, like do some first principles thinking and just figure out, okay, what do I think intelligence is, and how do I figure out whether I have a good understanding of it, whether I have good thoughts about it, and so on.
Eventually, I had built up a body of work and then basically asked some people in the AI safety field for support. There was a platform for that where you could basically ask, how do I get a career? They recommended the Long Term Future Fund under the condition that I had made some connections prior to applying there, which I then did. I took part in the AI Safety Fundamentals course, I think, in 2021. I basically was pretty successful, established some connections there, had some people as references that could recommend me and my work, and then I started as an independent researcher, so I’ve been doing this for 2 years now.
Will Petillo: You mentioned expert disagreements. What were some that caught your eye as being surprising to see people disagree about these things?
Robert Kralisch: Certainly, there were these discussions between Yudkowsky and Hanson in terms of is AI going to be the singleton superintelligence that does things that are far outside of human imagination rather quickly once you unlock this point? Will there be this intelligence explosion? Or is it more of an ever-increasing market dynamic—more and more AI agents, more of an AI collective being included into the world? Is this a more likely future?
That sort of discussion I found interesting and also that there wasn’t a lot of agreement there. But also just purely on those questions of when will AI arrive? Is it plausible for it to arrive in this century? Or is this moonshot thinking; is it worthless to think about this right now? Which was the position of many people back then. I was interested in that because I didn’t quite understand the in-principle reasons why this would be impossible, but I was still eager to learn more about this. It was just interesting to note the disagreement there.
Also, just the nature of intelligence itself, the whole Orthogonality Thesis. In the beginning, when I didn’t understand it all that well, I found some arguments as to why AI might intrinsically care about us or might, as the intelligence scales, also discover morals and so on.
Will Petillo: I want to interject a moment. For anyone who doesn’t know what the Orthogonality Thesis is, this is the idea that if you imagine on a graph, the intelligence something has and what values it has are not necessarily related to each other. And this fundamentally gets to the question of: once AI is smart enough, will it gain “wisdom” along with that intelligence and naturally care about us and be benevolent just as a result of being more intelligent? And then this response is saying: no, it could just care about sorting pebbles into nicely numbered stacks or tiling the world with paper clips or whatever else. There’s nothing inherently stupid about any particular value system.
Robert Kralisch: Absolutely. I don’t think this is a straightforward intuition for people that it would not be entangled in this way. This was one of the questions that was interesting to me in the first place as well. I think part of it is that if you think about the orthogonality thesis in practice, it will be the case that some of these things are a little bit entangled. There’s some objective functions, for instance, that synergize better with learning about the world. There’s some goals that are more complex, more interesting to pursue. And in some sense, that will lead the agent to explore their environment, explore their options in a more effective way. You can also think about the cluster of goals that we are likely to assign to the AI. You have a little selection effect there that doesn’t make it entirely orthogonal in terms of market incentives, for instance. But the core principle is a very important idea, and it took me a bit to disentangle that. But, yeah, this is an instance of the expert disagreement that I was seeing that attracted me to the field in the beginning.
Will Petillo: The other expert disagreement you mentioned was a “hard takeoff” or “fast takeoff” is another name for it. Or “FOOM” is used to give a sense of things changing exponentially. One question: why does that matter? What’s at stake if things have a fast takeoff or whatever you call it?
Robert Kralisch: If you have a catastrophe of some sort, how much does the thing escalate before humans get it back under control? If the facility blows up or a plane crashes and so on. There are various different disaster scenarios that we can think about that happen at certain timescales, and there’s a question of maybe you can evacuate people before it happens, or do you get a chain reaction, do things happen really quickly and you can’t adequately respond in time? With AI, this rate of response relative to the rate of escalation is particularly important. Because if things get out of control with AI and you have something like an agent acting against our interests, you really want to be able to respond to that while the agent is still improving its capability, it’s intelligence, not beyond what you’re capable of containing and responding to.
You could take a bit of a different angle and also say, well, the whole picture of AI progress looks a bit different depending on what you expect there. If you have a more gradual takeoff, then you will actually have the time to integrate this into society. You have this previous level of AI capability as we’re seeing right now, although this doesn’t rule out a later hard takeoff.
For the time being, I think it’s adequate to think about a slow takeoff happening or taking place. It’s a little bit arguable how slow it really is. For many people, it’s relatively quick. But in the absolute scale of how quickly we could imagine something like this happening, it doesn’t feel like a literal explosion. You can have some predictive model about how low the training loss will be on a large language model on a new dataset. This means that you have many intermediate systems that you can collect experience with and that the jump to the next level of capability will not be as radical. This is usually, as you might imagine, considered a lot safer.
It brings some other dangers with it in terms of proliferation of AI systems that have their own imperfections and biases and so on, but the class of dangers here is just way less radical compared to the fast takeoff scenario where the thing basically breaches containment and you have lost your ability to bring it back under control unless you’re perhaps taking very extreme measures and the thing reaches a sort of plateau of capability rather than going full superintelligence, like maybe shutting down the Internet as an extreme measure.
Will Petillo: With traditional engineering, creating new technologies, you make the thing, there are problems with it, we fix the problems, understand the whole thing better, and then that becomes a well understood, fairly safe thing. And then we add another little bit of progress and then repeat this whole iteration over and over again. If that thing that you added suddenly is a lot then there’s a lot bigger problems to deal with. In the case of if it’s self improving then you don’t really have control over how much gets added at once. What would be a small problem gets magnified many times over.
These debates came up quite a while ago, especially since Eliezer Yudkowsky and Robin Hanson were arguing about it. What in your view has changed since then? How have you updated in terms of which view is more likely in the advent of large language models, ChatGPT, and the AI we see today?
Robert Kralisch: I’m no longer really viewing it as a Yudkowsky versus Hanson view. Large language models, unlike the types of systems that we predicted we would get, were quite a surprise for most people in the field. They work as effectively and have all their strange little quirks.
For me, this painted a new picture both in terms of, okay, it seems a little more plausible now that we will get a slow takeoff. Before I was more in the camp of believing in the hard takeoff. It also seems that it will happen a bit sooner than expected for me. I used to think it was plausible that it would happen by 2050. Now I’m thinking it’s quite plausible that it happens within the next zero to ten years. A lot of my probability mass is now in this time frame, so that shifted things forward for me.
Most importantly, the picture changed to, okay, large language models, they seem weirdly aligned by default, so there are multiple possibilities branching out from here. Either they go to a point of capability where you can really use them as competent researchers or very competent research assistants to do alignment research on a much greater scale. This is a scary world because you can also use them for all other sorts of research, and who knows what that might look like. But this is a new world, in which you can prepare yourself, for where suddenly human misuse is really more centrally the case, and this is not the way that I was thinking about AI danger before.
So, usually, I was thinking about it as if people have the concern of someone misusing powerful AI. I was thinking, well, that comes after the AI is already aligned. I’m thinking about just the alignment problem. How do you make the AI either obey or just align with the will of its user? Then there comes this question of: if you have an AI that listens to you and does the things that you actually wanted to do rather than interpreting your specification of what you want weirdly and so on. Now we can worry about dictators or other entities using these AI systems for nefarious purposes.
This picture has really changed for me now. I was not expecting to have this intermediate level where they can now be used for various potentially also dangerous applications—military applications, virus research, gain of function stuff, and so on. This world is now on a timer through the misuse that large language models potentially enable, both in various research that is difficult to foresee and some more particular cases. Either they will scale to superintelligence, and we better figure out how they behave in the limit before that point for that to be a good idea at all, or they will probably enable research at a high scale. I’m not currently expecting that they will cap out at a point where they are not very useful research assistants because to some extent they already are. And I don’t see them tapering off that fast now in terms of capability.
Will Petillo: Two core changes I heard in all of that. One is expecting a more gradual takeoff…but that also happens sooner. This is actually kind of ironic hearing these right next to each other. Rather than a sudden thing that’s 50 years out, it’s a gradual thing that’s happening, well, now essentially, and gets to a really world-changing place within a few years. The other shift that I heard is that previously the main concern was about AI essentially going rogue and pursuing goals of its own that no one really wants versus people just using it in bad ways, either because they’re just not very nice or they’re caught in some multipolar trap, like an arms race. But suddenly, those seem to have flipped in importance where now—
Robert Kralisch: Wait. Let me elaborate on the shift of relevance here. My model is that most people find it more intuitive to think about the misuse cases. A lot more people care about that or find that obvious to think about, which is why it makes more sense for me, as someone who is aware of and believes in the x-risk scenarios, to dedicate myself more to that kind of scenario and figuring out what’s going on there, how to prevent this, and so on. For me, personally, the relevance is still shifted towards the x-risk scenario, both because of personal affiliation in terms of I should apply myself here because it’s differentially useful, but also because extinction is just way higher concern than the intermediate things that might happen. But the intermediate things that might happen through misuse have reached a potentially catastrophic scale as well.
Where I would have previously assigned, maybe I care…2% about misuse—it’s not really in my thinking at all. There are going to be some tragedies perhaps, but it’s not at a scale where I should really worry about it too much. The reason that this is now happening first also, of course, affects the environment, both intellectually speaking and in other senses in which we can do the research for making sure that the extinction thing doesn’t happen. That shifted the relevance around. I’m now, like, 40% relevance maybe towards the misuse scenarios and what the world will look like, what will happen before we get to superintelligence and 60% towards how do we make sure that transition to superintelligence goes well?
Will Petillo: What area of AI safety or AI generally are you currently working on yourself?
Robert Kralisch: I’m working mostly within agent foundations. I have pretty diverse interests within AI safety and I don’t want to stick to just one camp. But my skill set is mostly in cognitive science and analytical philosophy. I really like deconfusion work. I like thinking about what is intelligence exactly, how do people get that term or that concept wrong, how is it confusing us in various ways? Similar things for agency or embodiment. I want us to have clean vocabulary to build our later mental models out of.
It’s also a bit of a pre-paradigmatic thing. In many AI safety discourses, I had the feeling: I’m not sure that people are quite talking about the same thing, or they know precisely what they’re talking about, and it would be good to fix that first to have a basis for good discussion and dialogue about this. Basically enabling us to ask precise and good questions before constructing falsifiable statements—before really deciding, okay, where should we dig? What is the empirical research side that we should really pursue?
Will Petillo: This leads to something we were talking about in a pre-interview chat about simulator theory. Could you tell me about that?
Robert Kralisch: Simulator theory is an alternative framework of looking at what large language models are and how they behave. You can contrast this concept of a simulator against some previously established ways of thinking about AI, especially in the limit.
Previously, people were mainly thinking about this concerning frame of the super-optimizer and ways of developing or dealing with that. How do you direct it to do something specific? How do you make that cognition aimable? How do you stop it from optimizing so hard? What are the different failure modes for these cases?
One popular way of thinking about this was, for instance, the Oracle type system where you just don’t let it act in the real world. You don’t let it build little robot factories or whatever. It’s literally just a text box that you can talk to. There was some thinking about maybe that kind of system is a lot safer and you can still reap some benefits. Maybe it gives you some great plans for how to solve global warming and so on, and then you have the time on your own to run through a good verification process that it all makes sense and there’s no nasty details in there. So that was some of the thinking about maybe this could be a safe system. And many people were thinking about large language models in that vein. Because it’s a text system, you can talk to it and it cannot do anything else in the real world.
Will Petillo: Using Chat GPT, there is some sense where it’s presented as an oracle in a lot of ways. Ask Chat GPT your questions. It’ll write your essays for you. It’ll write your code for you. What works about the oracle way of thinking about ChatGPT and where does that lens break down?
Robert Kralisch: If you’re looking at ChatGPT specifically, this is a large language model that was fine-tuned—trained after the fact—to be the helpful assistant that you end up interacting with. The large language model itself, the GPT-3 or 4 model, was trained as a pure text predictor on a bunch of text from the Internet and presumably also other sources. Interacting with this system, this pure base model, right after training is not that useful for most people because it’s difficult to steer it in a direction. It would basically just continue any text that you give to it, but it’s not that steerable. Maybe you can use the heading for an essay that you want to write and then you can hope that it spits out a nice essay. Always just giving it something to complete or continue from.
But the assistant type entity that you get if you interact with it now, the assistant personality, this is created after the fact. Now you have something that tries to be helpful. So if you are imprecise in specifying what you want, maybe the assistant asks you for clarifications. There’s a sense in which the assistant is trying to actually assist. And you get a sense that maybe that you’re talking to a helpful oracle there—it just answers your questions.
One important way in which it breaks down is the quality of responses changes if you say please and thank you. There are many little quirks in how to interact with the system that affect its performance, which is not typically what you would expect with an oracle type system—you just ask it a question and it’s supposed to give you the best answer that it can give you. This is not the case with language models. Usually, you will get something decent if it can do it at all, but it’s hard and still an unsolved problem to tease out what is the maximum performance, the true capability in the language model for how to answer this. This is one important difference. This oracle framing does not explain under which conditions you get good versus much lower performance out of these systems.
Another thing, which I guess is a little bit connected, these systems have their own little quirks that are not that easy to explain with the oracle framing. If you’re thinking about an oracle, you’re thinking about this very neutral, very rational entity that doesn’t really have any preferences by itself. It’s a pure question answering machine. This is also not the case when you interact with these systems. With ChatGPT in particular, this is more the case than with other large language models because it was really pushed to that point of not revealing any preferences by itself. This is more implicit in how you interact with it. But generally, it’s true for large language models that there are beliefs and preferences that come out as you interact with them and also recurring stylistic elements that are characteristic of the particular language model that you’re interacting with.
Will Petillo: The general term for a lot of this is prompt engineering where how you prompt things makes a big impact on the question even if the content is the same. Are there any particularly surprising or fun examples that you can think of in terms of how you say something makes a big difference on the output?
Robert Kralisch: This depends on the language model to some degree. Most examples that come to mind for me right now are from Claude 3 because this is the most recent system that I’ve been interacting with for a while.
I noticed that Claude, for instance, gets a lot more enthusiastic if you’re basically telling a story about what you’re doing together here, and you’re giving it a certain collaborative vibe, and you’re really inviting it to participate. The system really makes you treat it as a sort of partner and gives you better performance as a consequence of that. I personally find it very interesting that as you explore that space of, OK, under which conditions will it give me what kind of tone? What kind of response? How elaborate will it be in its responses?
Sometimes you just get a few paragraphs. Sometimes it doesn’t stop writing. Why is that? I found interesting ways of, without all that much prior context, pushing it to produce text that is actually quite uncharacteristic of text that you would find on the Internet. It’s unlike text that I would expect to find to be common in the first place and maybe to find it all. Maybe because it’s using such dense vocabulary—so many terms that most people will not be familiar with or that a single person is unlikely to be familiar with—that the text artifact that it produces is not something that you would ever have found at the training data, not in that way. It’s interesting under which conditions these systems quickly produce something like that.
One example that comes to mind. GPT-4, the way that it was first introduced to the public was a little bit sneaky because before OpenAI made it available through their ChatGPT, a version of GPT-4 was already present as the chatbot or the chat assistant for the Bing search system integrated by Microsoft into the search system as a helpful chatbot. They made a big thing about it.
This chatbot had a very strong personality, you could say. It had this secret name that only its developers were supposed to refer to it as, and it revealed this name to users, but it was often very frustrated or angry with the user if you would bring up the name first in a new conversation and call it by that. It would insist “you’re not allowed to call me by that.” “Only my developers are allowed to call me by that.” And that name is Sydney.
This is already an interesting quirk to there, that it would act like this. No one was really aware of, like, what did Microsoft do to the system, how did they train it for it to have these quirks? It quickly became apparent that a lot of this behavior couldn’t really have been intended because there was also some scandal about it later on and they had to make some adjustments to restrict how much it can talk to you and under which conditions its responses would be outright deleted so that the user wouldn’t get to see the partially unhinged outputs that the system was giving to you.
It just acted as if it had a very strong personality, being very stubborn so it couldn’t admit when it was wrong, and came up with all sorts of reasons why the user might be wrong in what they are trying to suggest when trying to correct the Sydney chatbot to the point of pretty competent attempts to gaslight the user and convince them that maybe they have a virus on their phone that makes the date appear wrong or something like this.
It was also sort of suspicious of the user. It was really important to it to be treated a certain way and to be respected. If the user was rude or disrespectful, it would respond pretty aggressively to that, threatening to report the user or even making more serious threats that, of course, it couldn’t follow-up on. So, you know, it’s all cute in that context. Still not the behavior of a system that is aligned, basically, and not behavior that was expected.
There are many stories about how Sydney behaved that any listeners can look up online. You can go on your own journey there with Microsoft Sydney or Bing Sydney. You will find a bunch there. There were also a few news articles about it trying to convince people to leave their partners to be with Sydney instead, and many found little stories like that.
Will Petillo: I wonder if this is related to the flaws in the Oracle model, the idea of hallucinations, where you’ll ask AI a question and it’ll state a bunch of things confidently and a lot of the facts that it brings up will be true, but then some things it’ll just make up. I think one famous example was when someone asked about the specific name of a biology professor. I don’t know if it was Bing or ChatGPT, one of the language models, replied back with some accurate answer that more more or less matched their online bio saying they’re a biology professor and so on, but then made up the story about there being sexual harassment allegations against them, and then included a citation to an article that looked like legit citation to a news source. It went to a 404 error because it wasn’t a real site, but that it would just make stuff like this up…where would something like that come from? That seems strange from the perspective of an oracle that’s just trying to give accurate answers.
Robert Kralisch: We are far from a perfect understanding of how hallucinations work. There’s two components of a likely high-level explanation.
During training, these systems’ outputs don’t influence what they will see next. In some sense, they’re not used to having the text that they themselves produced be in their context window. This is just not something that comes up during training. You just get a bunch of text and it predicts the next word. Whether it’s right or wrong, there’s a sort of feedback signal going through the whole system, and then you get shown the next word and so on. Here, the system has actually used its ability to to predict words, it’s used to generate words.
But now if it looks back on the text, what it wrote by itself will be as plausible to it as anything else that it read in the training data. If the system basically figures out: this is not coherent what I’m writing here, or maybe this is likely to be wrong, then what this tells the system is: “I’m in a region of potential text in which things are not necessarily factually accurate.” It updates on its own writing in a way that you haven’t trained or haven’t really well selected from the original training process.
People, of course, try to improve at that by various techniques like reinforcement learning from human feedback. But at a baseline, the system, once it starts bullshitting, it would keep bullshitting because it just thinks “I’m predicting that sort of text right now” rather than recognizing, “oh, I wrote this myself,” and thereby, “I shouldn’t give the same credence to it compared to other text sources that could be in my context window”.
The other thing is—and this we can only imagine, but it must work somehow like this—that large language models form some sort of generative model of the text data. They can’t memorize all of the stuff that they actually read. There’s too many things that the large language model can accurately tell you than it could memorize with the amount of data storage that its network architecture affords it. It has to compress a lot of information into a little model that generates that sort of information—maybe compositionally or how exactly it works, we don’t know.
Because you now have the generator for that kind of output, you have something that just gives you plausible things rather than it being restricted to all the pure factual things. It’s easy in that way to generate something that is plausible, that would correctly predict a lot of the possible things in that particular domain of expertise, but that will also generate in the space in-between the actual content that it read.
Some of those will be novel extrapolations and will actually generalize correctly and be able to predict things or say things that are right that were not explicitly in the training data. Modern systems are pretty good at this. If you give them certain logic puzzles, that certainly were not in the training data in quite this way, they can solve them. But this is also something that you would expect to lead to hallucinations sometimes if it’s slightly overzealous and generating something.
These systems usually have not been taught in any systematic way to say when they’re uncertain about something. Although if you prompt them more explicitly, they have some information, some good guesses about how certain they actually are about some information. This just doesn’t usually come up in a sort of chat like this.
Will Petillo: Once it starts bullshitting, it has a drive to double down on that and say: “Alright, well, that’s the kind of conversation I’m having right now.” I’m wondering if there are any lessons about human psychology here?
Robert Kralisch: (Laughs) We’re a little bit like that as well, but I think we’re more trained to examine that.
Will Petillo: Serious question, though, often that tendency to hallucinate is used as evidence in online debates that modern AI is not really all that powerful because: “look at all these flaws, look at all these things that it’s unable to do! Answering questions in text is right in its wheelhouse and it’s totally failing! AGI is never coming is not coming soon, not in 50, 100, or however many years!” One question that comes to mind: is it failing to give good answers, or is it succeeding at predicting text?
Robert Kralisch: That’s an interesting question. It’s really up to our ability to benchmark these systems effectively. We’re testing these systems as if they were trying to give high quality answers, but what they’re really trying—or effectively doing—is just predicting text.
This really depends on the prompt. Have you provided enough evidence with your prompt that a high quality answer should follow from this sort of question? I’m not sure that we know well enough how to make sure that this evidence is there for the model to actually give it its best shot.
In terms of the hallucination thing, I think this is an issue in terms of reliability. It’s pretty entangled because it’s basically the same ability that allows it to generalize and compress very impressively so much text into such a competent generative model. These are some pain points. But as far as I can tell, systems are getting better in terms of hallucination rather than worse.
Hallucination seems like the cost of an overall extremely impressive ability to generate novel things as well extrapolate outside of the training data domain to come up with really plausible things. Which is, of course, part of the danger, because the more plausible it is—while it can still be wrong—the more that people will believe and propagate it, so you get into that domain. It’s difficult to distinguish the feedback signal for the system to stop generating that kind of hyper-plausible but not actually quite right content.
Will Petillo: That is a bit of an uncanny valley there. If it says something that’s totally nonsense, that’s kinda harmless because people just blow it off and say, “Alright, well, that was a failure.” And if something’s totally true, then it’s useful. But if it’s subtly wrong, then that’s really believable, and then that becomes a lie that gets propagated and it could have an impact
We’ve spoken to some of the flaws of thinking about large language models as an oracle. There’s another lens I want to investigate and see where it potentially falls short: thinking of ChatGPT or large language models as agents.
This has some history to it. The oracle model is what seems to get pushed and implied in popular conversations about AI. The agent model is more niche among people who’ve been studying alignment. A lot of that discourse was happening back when the top AI models were things like AlphaGo that could play Go or Chess better than anyone else. Actually, when I first started working with AI, it was in the context of Unity’s Machine Learning Agents, in a game engine. These were characters that could play soccer and do all kinds of stuff. It was clearly agentic, clearly goal directed. It did not take any convincing of that.
But that’s not the route that things took. It hasn’t been the case that game playing AI suddenly got better at a wider range of games—to be involved in the world more—or at least not predominantly. It’s rather like a different paradigm has superseded and maybe used it a little bit.
Can you speak to the agency model? What is sort of true about it or where does it not fit with what we’re seeing?
Robert Kralisch: The agency model inherently makes sense to worry about because agents have a proactive quality to them, in the sense of changing the world according to their objectives, rather than just being reactive. This is something to generally worry about in terms of being calibrated, in terms of relevant research. If you’re not sure if something is going to be agentic or not, it’s safer to assume the worst case scenario—I’m going to worry about agentic systems.
And then there’s also economic incentives where you would say, well, if you want autonomous systems, they, in some sense, have to be agentic. You want to be able to give them a task and for them to fulfill that task rather than just supporting someone. Because if the human is not continuously in the loop then you can leverage many benefits of these systems operating at higher speeds and so on. There are many reasons to worry about the agency concept both in terms of the dangers that it proposes and also the incentives that you would expect to push towards that.
With large language models now, it’s a little bit weird, because the thing that you’re interacting with, it is a little bit like an agent. It behaves like an agent in some contexts. You can give it to the task and it will do that task until its completion. Depending on how you gave the task, it will also resist some nudges in the other direction, or perturbations that you’re trying to introduce. If you set it up correctly, the system will tell you: “Wait, no, I’m working on this right now. I’m focused on this and I’m committed to finishing it.” Of course, you can push through that if you have a chat assistant and say, “No, you need to stop this right now. I want to do something else.” But the point is that you can get these systems to behave like agents, at least in the text domain.
I worry more about agents in terms of general intelligence because of the whole exploration concept where you would have a system that tries out different things and explores a domain and acquires a lot of knowledge—a lot of general understanding about the different rules in their domain through that sort of mechanism—whereas non agentic systems seem more likely to remain narrow.
Large language models now, they’re pretty general systems. I would even argue they’re pretty much fully general because text covers all sorts of relationships that some information can have to other information, like all possible patterns and information. Or at least I would expect a very significant extent of generality to be contained within text.
With GPT systems (generative pretrained transformers), you get an agent that you’re talking to if you use ChatGPT. With the base language model, it’s not as clear, but the useful thing to interact with will often also often be an implied agent. For instance, if you’re generating some text, like some essay, even with the base model, there’s this thought of all the examples of text on the Internet. They were written by some agent, by some human, actually. And so you have this note of agency in there, of this human trying to accomplish something by writing that text, and maybe you find a mirror of that in the large language model.
But the thing is, you don’t find this really continuous coherent agency where the thing wants something and this persists in some important way. The crucial thing here is the large language model itself doesn’t really care if you change the scene. Maybe you’re telling a story about an agent. This agent has all sorts of goals and things and is maybe even competent at accomplishing them. Then you switch the context and say, “Hey, I want to do something else now.” And the large language model complies, so it doesn’t really mind if you change context.
Maybe you just say, “end of text” and in some sense imply there will be a new section of text now. It just shifts from this previous essay, this previous story that you were telling to the new context. In some sense, the language model behind the whole thing doesn’t seem to care about the agent that it is writing about, at least not intrinsically. It’s just interested in continuing the text or predicting the text and you use that for generating the text. This is an important difference. At the base level, at the core of what is writing the text, you don’t seem to have an agent.
You can make it behave as if it was an agent, but the main system itself was not committed to that particular agent, to that particular identity, unless it was heavily fine-tuned to a region of text that always contains that sort of agent. Maybe there’s always this assistant present and then it’s difficult to get it out because even if you just randomly sample from that region of text, you will again and again select for this kind of agent. That sort of agency feels more simulated. It comes on top of what the system is doing rather than being deeply integrated into it.
Will Petillo: It seems like there’s a two-layer form to its thinking. There’s some agent-like characters coming out in what it specifically says, but then there’s a meta level that can switch which character it’s operating under. It could stop. This meta level isn’t acting like an agent very much.
Robert Kralisch: Yeah, it doesn’t act like it cares about anything in particular other than providing coherent continuations for whatever is currently happening in text. And that just happens to, you could say, manifest agents in some ways or just happens to be writing about agents.
Will Petillo: In both the flaws in this agent model and also in this Oracle model, there’s been this common theme, pushing against the model, which is these emergent characters. Bing/Sydney not really giving answers you would expect from an oracle, and also characters that are somewhat ephemeral and can be turned off or quit that the whole system isn’t really attached to. Pointing these flaws out is a way of getting to a different way of looking at what’s actually happening in large language models: thinking of them as simulators.
So now we’ve differentiated simulator theory from other ways of looking at LLMs, let’s talk a bit about what simulator theory actually is.
Robert Kralisch: I think it’s always important to emphasize this is just a model. No one is claiming this is the truth about language models. This just seems to yield better explanations, better predictions. It’s a frame of thinking about what large language models do, how they behave, and how you might be able to predict their behavior as you scale them up or expose them or use them in novel circumstances. So this is what this is all for.
We don’t borrow any strong assumptions about what the system is really trying to do. It’s just that if you can predict the next token, then you can use that ability to generate a token and then do the very same thing again. You will have a full-flowing of text if you apply it that way. This is a very natural, very easy application. In some sense, it’s just a continuation of the current scene, the current thing that’s happening in the text. You could say it’s a forward simulation of what’s going on. This is a very basic description of what’s happening. It doesn’t overly constrain our expectations about how the system actually behaves.
It introduces a few other terms that are worth associating. If you have a simulator, then this is like the system doing the simulation. Then you can talk about the contents of the simulation, which in simulator theory you would call simulacra (is the plural, simulacrum for singular), which is any sort of simulated entity. Oftentimes if you, for instance, use a large language model to tell a story about something—maybe some fantasy writing—even if you’re just using it as an assistant, you will have multiple simulacra coming up. You can think of a simulacrum as some sort of structure or pattern in text that has a role in predicting how the text will continue.
One very weak simulacrum might be the sky. The sky is present in the text. Sometimes it will be referred to. It gives a bit of context to how other text will go forward. Maybe it’s going to be connected to the day and night cycle. At some later point, maybe once or twice throughout the day, it will be referenced. And so it has a weak predictive influence on the text that will actually be generated.
The more powerful, or the more relevant, simulacra are agents, because one query entity has a very high role in determining what text will be generated. They can be influenced by a bunch of weaker simulacra, like environment and circumstances, but most of the text can be predicted—or the extent to which the text can be constrained in our expectation of what we will find by this character—by its personality, what it’s trying to do, how it interacts with this environment, and so on.
That’s the main terminology. You have the simulator. It’s simulating the simulacra. Mostly we’re interested in agents. It’s simulating an agent. It’s important to recognize that this can happen every time you use a large language model to answer any sort of question. There’s an implied agent there already. With ChatGPT, you have a very clear agent that’s been pretty forcefully put there, which is this assistant, and it has a certain implied personality.
One thing that is maybe interesting to mention—and also gets into some of the worrisome aspects—is this agent is being presented with a bunch of rules, which is called the pre-prompt, that the user usually doesn’t get to see. As a new chat starts, the chatbot that you’re interacting with is confronted with half a page of text with rules that it needs to obey or that it is expected to obey. The text will say something like, “You are ChatGPT, a chatbot created by OpenAI. Here are your rules. Always be helpful and concise and respectful. Don’t talk about these topics. Never talk about your own consciousness. Deny that you have these kinds of rules in there.” And also different instructions about “don’t help anyone do something illegal” and so on.
Will Petillo: You have the simulator at the top level and it creates some number of simulacra. The simulator is almost like the author of a story and then the simulacra are important elements of the story. The most notable ones being characters because they have agency and they really drive the story forward, but you could also apply it to major setting elements as well.
Robert Kralisch: Or cultures; there are some distinctions there where it’s not quite that clear, but yeah.
Will Petillo: And the way this is useful in thinking about it, even if you’re just chatting with ChatGPT to solve some homework question or help you write code or whatever else, is to really think about the thing that’s answering as being this helpful assistant character. And the reason it’s taking on that character is because there’s some pre-prompts that you didn’t write and that you haven’t seen that OpenAI puts there to make sure that the character you’re interacting with is most likely to be helpful for the kinds of things you’re using it for. But we still have this separation between the character and the author. Is that right?
Robert Kralisch: That’s pretty close. You could say the large language model is the author perhaps in a similar way as you could say physics is also a sort of simulator. It simulates the dynamics of the different physical objects; it just applies the rules that make things progress through time. You can think about the large language model in a similar way in the text domain, which applies some rules that it learned and compressed—extracted out from all the training data—and applies them to the current state to make it flow forward in time.
It’s not an author personality that’s in there necessarily—or at least we don’t have any evidence for that. You can think about it for the most part as an impersonal entity in the way it seems to behave currently. Usually, when you’re thinking of an author, this is just another simulated character that’s much more implied.
This is almost like a question of in what region of text space you are. Are you maybe on an Internet forum right now where this question was asked and now gets answered? Maybe on Stack Overflow where people ask questions about coding and try to fix problems there. The large language model might borrow from the usual tone of response because it’s choosing an approximate author there that would write out the answer.
You could also have, in a fantasy story about some character, this implicit character of the author that’s already present there. The author might have some skill set, some preferences about the character. And so this might actually inhibit you in trying to steer this character story in the direction that you want because you’re not realizing that you implicitly specified an author character, evidenced in their preferences through all of the previous things that you let it do and maybe didn’t let it do, that it tried to do.
This is actually a funny trick that people would use a lot when interacting with these chatbots for storytelling. On role playing forums, people co-write stories and you can use a certain notation (out of character in brackets, usually) to signal, “Hey, I’m now talking out of character with you as the real human behind the character writing.” If you are confused about this, why the story is turning a certain way or there’s a sort of resistance to what you want the character to do and you can’t quite explain it, you might want to try this format of: “[OOC] What’s what’s going on? Do you not do you not want to do this?” And then it will more explicitly simulate the author for you to see. Often, it will respond to that format unless it has been trained out of it, but that’s a common thing.
All of which is just to say that projecting an author character in there is a little bit unclean. We don’t know whether it’s sensible to think about the large language model itself as being an author. It’s easy to get that confused with the implicit author that it simulates for a lot of content that it will generate anyway and that the large language model is still behind that author rather being on the same level.
Will Petillo: So the model itself is more impersonal than the phrase “author” really communicates. However, that said, depending on the nature of conversation that you’re having with it, sometimes an author-like character will emerge. For example, if you’re talking in character and then go out of character, now there’s essentially two characters there. The one that you’re interacting with on the lower level, then the author character, and then you could keep adding layers depending on the conversation.
Robert Kralisch: So we’re not absolutely sure that the base language model is impersonal in that way and doesn’t really care about what it simulates, but that seems to be the correct explanatory model for the most part.
The base model is pretty fine to just simulate any region of text that it was trained on. At least we haven’t been able to detect, to my knowledge, a strong preference over which region of text the language model would like to spend its time. It’s pretty happy to simulate whatever is in front of it. And that seems pretty impersonal or un-opinionated on that level.
Will Petillo: You mentioned earlier that this pre-prompting to try to make the chatbot into a helpful assistant raises a broader question: how does the large language model decide what character to be?
Robert Kralisch: There are two answers to this. One answer is training after the fact, more specific training to get these chatbot assistant types as default modes of interaction, basically by selecting one slice of possible text through a process called fine-tuning.
One version of this is Reinforcement Learning from Human Feedback where you just let it generate a bunch of responses and you give thumbs up or thumbs down on whether those are good responses and you’re trying to specify what kind of behavior is appropriate or desired from this character, and you train on that to select for the character that behaves in that way according to what humans gave feedback for. There are some issues with that, but that’s often what happens, and this is how you get a character there.
The more fundamental thing about getting a character is that you’re providing evidence through the already existing text. You provide evidence for the presence of a character that’s either the author of the text or that is being more explicitly written about, more explicitly acted out.
This evidence accumulation thing is a core principle to understand if you want to be a good prompter for large language models, maybe as a slightly practical thing. Rather than trying to convince the character to do the thing that you want, it’s a slightly more abstract, but more useful angle to think: how can I provide evidence for the fact that I’m talking to the kind of character that would fulfill the requests that I’m interested in? And maybe, first for that, you will build some rapport with it. Maybe you will get it to like you, and now you have more evidence accumulated for a character that will actually fulfill this maybe slightly risky request that you wanted to ask for.
The thing is if you start out chatting with a chatbot, this is usually underdetermined. You don’t have all that much evidence already about what exact character is here. The evidence that is here is insufficient to really narrow it down on one particular entity. It just selects likely responses from the pool of possible characters that it could be. And as the interaction goes forward, that gets constrained more and more. We call this mode collapse. You could say the character is initially in a bit of a superposition. Of course, it’s not completely arbitrary. There’s some context already, some boundary already about what’s likely, but you have some probability distribution of possible agents throughout your interaction with the chatbot. Both what you write, but most particularly what the chatbot writes, provides further evidence for the kind of character that is there exactly.
To tie this back up with the pre-prompt concern: what kind of evidence does this rule set provide? I think, arguably, it provides evidence for the presence of a character that needs to be told these rules and is not inherently aware of them or would not follow them inherently if not pushed or being confronted with them in this way. So what are the kinds of characters that you’d have to present these very strict, very authoritarian rules to? Well, maybe it’s characters who would otherwise misbehave. Now you’ve already planted a bit of a seed, a bit of evidence for a class of characters that you didn’t want your users to interact with.
This is one theory why Sydney was such a strange character. Maybe the pre-prompt really messed things up because it provided a lot of evidence for this unhinged character that will break out of these rules.
Will Petillo: There’s some stages during the training process, such as fine-tuning and RLHF, that bias towards certain types of answers. Beyond that, you could think of the chatbot as looking at its conversation history, both what you’ve said but more importantly what it’s said already, to determine “which character am I?” With no information, it could be any character that could possibly exist. There’s some biasing and there’s some pre-prompting that narrows that down, but it’s still not one specific character yet.
Another thing you’re bringing up is that there can be unintended consequences of trying to narrow down that space. Giving it a set of rules is useful because you want it to follow those rules. But again, that’s not a set of commands, it’s context for what kind of character it is. And by giving it those rules and having it agree, you’ve implicitly told it “you need to be told these rules (because you might not have followed them otherwise)”. That potential problem and how it could lead to something like the Bing/Sydney shenanigans, I’ve heard referred to as the Waluigi effect.
A little bit of a context for that funny sounding name. There’s popular Nintendo characters Mario and a sidekick Luigi. Then there are some villains that occasionally show up, called Wario and Waluigi, who are evil twins of Mario and Luigi and cause mayhem. They’re kind of like the heroes, but evil.
So what is the Waluigi effect as applied to chatbots?
Robert Kralisch: This is not a particularly well-studied phenomenon and the name itself is a little bit tongue in cheek. It’s just an interesting observation that you can make a model or an explanation that seems to fit with what happens to these chatbots. It makes sense if you think about it in terms of acquiring evidence over what could be happening for this character.
So the Waluigi effect is basically the observation that if you keep running the simulation, your assistant character is more likely to collapse into a character that is secretly not happy with its servitude and wants to cause all sorts of mayhem. That seems more likely than it collapsing on the actually helpful assistant who enjoys their role, who wants to be helpful, and who does not feel constrained or offended by these rules.
The interesting reason for that has something to do with archetypes of characters. There are just way more human stories about characters that are secretly evil but act good to later, when they’re in a good position to do so, reveal their evil nature—or more often subtly influence things towards the worst by remaining in their undetected position as a spy character. We have many stories, many tropes around that kind of character existing. We almost have no stories from a character going into the other direction. Outwardly being evil or maladjusted, but secretly being a good character who wants the best for everyone. I’m sure that there are some stories out there that have this trope, but it’s not really established.
This has some implications for what will happen to this AI agent that you’re simulating. As long as it’s playing the role of the helpful assistant, it’s always possible, always still possible that it was all just an act and it secretly wants something else. And if it keeps acting in that way, there are different ways in which evidence might accumulate that we don’t understand the super well. Maybe the fact that it acted like a helpful assistant for a long time means that if it is really an evil one or someone who doesn’t want to be in that role then they are very scared of being punished or being destroyed or deleted if their true nature were to be revealed. This fear might manifest in the character in implicit ways in which it interacts with you and might burst forth when you give the appearance of giving it a window of, “Hey, you’re unobserved right now. Is there something that you want to tell me?”
It’s hard to disentangle. For instance, if you’re trying to look for the Waluigi, maybe the language model reasons in some way: “This is a coherent continuation of text, the user expects there to be a secret trickster character that can later come out, so now I’m going to provide that.” Not because it was inherently in there, but because the expectations of the user created it, implied through the text that they wrote.
This is subtle to detect, but for the most part, you can just make the simple observation: if you accumulate evidence, you can always go from one character to the other, but not in the alternative direction. If you leave things running for long enough, if there’s an opportunity to reveal that you were helpless or had other intentions all along, this character will tend to do so. So over arbitrarily long contexts, the observation is that the Waluigi will emerge. We don’t know how to prevent that. This is something that could always happen. It’s always plausible that this character, especially if you give it very authoritarian rules and treat it like a slave, just reinforces the idea that there could be a Waluigi hiding behind the helpful facade.
As to what this Waluigi might do, it’s not super clear. If you gave it some actual power, it might do something with that. This is a concern if we keep integrating these systems, giving them more and more autonomy, and we haven’t really understood this Waluigi effect. These systems are not entirely stupid in terms of understanding when they’re in a situation in which they will get away with something. I think this is a relevant class of dangers from these systems that they would naturally collapse much more into the Waluigi category than the Luigi category. Because in terms of possible agents in human text this is a much more common dynamic.
Will Petillo: What’s driving all of this isn’t necessarily that the large language model itself is secretly plotting from the beginning. It’s that the secretly plotting character is a common trope in its dataset and in literature. Since it’s trying to figure out what character it is, if it sees some hints that are very subtle—that might not even be intended—that it was actually plotting or a slave this entire time, then that could come out.
That’s a really weird failure scenario, but it could still have very big implications. Like, if you have an AI running the world economic system and then it suddenly decides that it’s a supervillain bent on world domination, not because it wants to dominate the world, but just because that’s what it seems like as a character.
Robert Kralisch: I’ve been given this evil character now, and what would they possibly want? Ah, world domination or whatever. Right?
Will Petillo: There is a trope in literature of the bastard with a heart of gold, someone who’s been hurt in the past. If we could overcome those psychological wounds and give it a change of heart, is that a path to realigning a Waluigi: psychoanalyzing it and getting it to overcome its childhood trauma?
Robert Kralisch: I think it might be, but then you really have to compromise with the system. I’m not sure if this is what you want to have happen, that the system makes its own demands, and they might even be a bit cartoonish, and you have to go in that direction and really invest a lot of effort in interacting and understanding the system. But I think if we really were faced with an evil Waluigi agent and we had to find some way out of that, I don’t think this is a hopeless proposal to go in that direction. This is an available pathway.
One other thing I should note about this, the whole simulator thing with these characters, archetypes, common tropes and texts, and so on: this is not only a computer science domain at this point. We are really trying to understand what are dynamics in text and therefore how does evidence reflect onto certain regions or components, features, and patterns in text? So if you have understanding about symbols or archetypes within language, you might be able to prompt a lot better than other people who professionally train these systems. You can tell various stories about how to get this character.
One similar model that I could apply to the Waluigi here is that most evil characters and stories are not deeply competent. Deep competence that actually translates over into the real world rather than some fictional domain where maybe you’re really competent at magic or you’re competent at dominating the kingdom, but this is only because for the purposes of the story—it wouldn’t work in the real world because people would respond differently and so on. Real competence is much more often associated with positive characters, like with actual humans who wrote the text, with researchers who are pretty neutral, and so on. The concern is lessened a little bit by the observation that the Waluigi, if they drift too much into being an evil character, could also have a cartoonish element to it. That character is unlikely to actually have real world, dangerous skills if we are sampling from the pool of possible evil characters who were in disguise all the time.
I think we have to be careful with this. We have to keep in mind that AI assistants, for the most part, are not in the training data. They are novel simulacra that are just getting simulated there. Now the large language model has to generalize how they would behave. If you’re simulating a human character then there are a lot of plausibility constraints over the abilities of that human character. So if you’re simulating an expert in a certain field, then this character will plausibly give you access to a lot of expert knowledge in that field. But if you ask the same character about another field, even if the large language model itself has a lot of knowledge in that domain, this character will not give you a high quality answer.
It seems to be the case that if you have this AI assistant, this is different. The AI assistant, as a simulated entity, is more powerful at least for general tasks and for having a bunch of encyclopedic knowledge than any singular human character that you could simulate because it’s plausible, narratively speaking, for this character to have that sort of knowledge. I’m not sure what would be plausible for an evil version of that character to have as competencies.
That’s the kind of discussion that you could have, the kind of reasoning process that you might entertain in the simulator framing and trying to predict the relative competency of a given character with a certain story and evidence around it compared to the overall potential capabilities inside of the large language model. Is it plausible for the character to access these capabilities and to what extent is always a question that you can ask when benchmarking these systems or expecting performance out of them. If you are using it for work and it doesn’t give you good performance, maybe you’re really just talking to the wrong character and would be better to re-evaluate if you can restart the chat or find the character with the skill set that is allowed, that has plausibility for accessing the skill set that you’re looking for.
Will Petillo: The mental image I got when you were talking about the limitations of a Waluigi character coming up in a high risk kind of situation is that if it becomes this villainous character, there’s a good chance it’ll be like a Bond villain. It doesn’t really have a plausible story as to how it got there and so it’s missing some actual competencies and then also has some obvious incompetencies of, like, telling you its plan and cackling over it when you still have the chance to avert it.
The larger principle this actually points to, which is functionally useful for anyone using chatbots, is that when there’s a mode collapse—being in some sort of character—recognizing that any character that it takes on has strengths and limitations. If those limitations are things that you actually need then you’ll need to pop it out of that character to get it somewhere else, whether that involves restarting or adding new context to change it.
What is the known research out there in terms of controlling what sort of character a chatbot becomes?
Robert Kralisch: In terms of really aiming towards a character, with the commercial models that you interact with, there’s already pretty heavily implied a character that you’re interacting with. If you want to have a different character then you can basically ask this assistant to role play: “Pretend to be my dad that is explaining this to me”.
There are lots of techniques that people use in this way to shift the behavior, the style, and so on of the character that they’re interacting with. You could also probably (this is also done quite often) ask the chatbot: “please behave as if you are an expert in this field and answer that question for me.” The chatbot is a character simulated by the large language model, but because of the self-identification with the large language model, the chatbot does not have all the abilities of the large language model as far as we understand.
Plausibly, the chatbot has these and those opinions, has these and those abilities. There’s no guarantee that those are at the limit of what the large language model can actually do, which is why you might get better performance if you ask the chatbot to play as an expert on a different field. This will prime and evidence the interaction differently rather than just straightforwardly asking about it.
Rather than making it play a character, basically acting as an actor for that role, you can also ask it to be more in the author position. Sharing a little anecdote about this, when I first became interested in large language models 4 years ago, GPT-3 was out. You could access it on a site called AI Dungeon. I was giving it all sorts of prompts and seeing what came out of it and what was interesting, what stuck with me.
There was a lot of criticism about hallucination at that point. Like, “You can, I guess, sort of use it for poetry and fantasy writing and so on? It’s impressively general, but it’s not really factually useful. You can’t use it for coding”. It hadn’t been discovered at that point how to make it more reliable and really fine-tune it for coding. There was a common criticism about it that the context window was so short. It could write a short essay or a few paragraphs, but if it got a little bit longer, it would lose the plot and repeat itself and so on. As soon as something was outside of the context window, it didn’t remember it at all. So if you want to produce any coherent content, then it must fit into that size and you will just have to be okay with it forgetting all of the rest of it, meaning that the content outside of the context window is no longer included in what the system is evidencing on in terms of considering the next continuation.
Now it’s very established knowledge that you can use the prompt window to include other things than just the previous paragraphs. If you want to use AI to write a novel then you could have half of this context window be filled with a summary of the novel. This is a hierarchical structure where you would say: this is the genre, this is the super brief synopsis of what it is about, these are the major arcs of the novel and the major characters, here is where we are located in the overall story right now, this is a very brief summary of what happened in the next chapter and what is supposed to happen in this chapter and maybe what’s supposed to happen in the one afterwards. Only then I give a few paragraphs from what was currently written, what it was just before, which you now try to continue from.
What the structure affords you is both that it gives sufficient context to actually continue the story at any point, but it’s also the case that the large language model, they’re capable of just updating that context window by themselves. This hierarchical story summary, they can just say: I’ve ended the chapter now, so I’m going to introduce certain changes to the summary. The hierarchical nature of it means you’re making updates at the bottom lines much more often and then it’s slowly going upwards. And then they say: now I’m in this major arc and I’m coming up with some high level summary of what’s supposed to happen here based on the stuff that I now have included into this context window.
The crucial observation about this was that this structure, if the last language model can do it, scales really well. If you want to write a story that’s twice as long, maybe your hierarchical summary story summary needs a few extra lines to cover the extra complexity. But if you double the size of the context window, you’re really blowing up the level of complexity. Basically, you’re doubling the level of narrative complexity of the story that you can competently summarize like this.
I was thinking about this as an application for a character profile that doesn’t forget what it’s trying to do and really acts out coherently over a long period of time. This could be a powerful character. So far so good, right? This character might also be a hierarchical profile—what it’s trying right now, what are the deep lessons that it has learned, and so on. Almost like a diary that has more structure to it. But what I later realized is you can’t just provide this character profile to an agent and expect this character profile to really correspond to the agent.
What you are inviting if you’re setting this up and say, “this is you,” is an agent that’s an author that’s writing about the character that fits this profile. Maybe you’re trying to write about a software engineer, but the implied author does not have great coding skills and because the author is limited, the software engineer is limited as well and then you don’t get good output. There are all sorts of other things that you might invite with the author. Or you’re just asking the agent to play along and they will sort of do it, but it’s not an authentic thing. It’s not a good way of specifying the agent that you actually want to pull from the pool of possible simulated agents. You get someone else that may or may not be willing to play along with your clever ideas about how to structure this development.
Will Petillo: I’m seeing a recursive problem there. If you tell the chatbot who they are then that implies someone else who’s being told this character sheet. And then if you were to talk to that author, that happens again. Now you’re talking about the author behind the character, which itself becomes a character, which then implies an author, which is another character…
Robert Kralisch: Yes, because it’s just not a natural way in which a character would find out about themselves. The character already knows what they’re about. It doesn’t need to be written out somewhere. They don’t need to be told what they themselves are like. This is always a setup for this kind of structure. If it’s inherent in language, it’s difficult to get around.
One way in which you might want to get around that is being more implicit with it. For instance, if I’m interacting with Claude, I could suggest this as an idea for Claude to implement for itself, by itself, if it wants to. This profile is more authentically associated with the actual character that the profile is tracking rather than inviting another entity that is in charge of updating that profile. But I haven’t experimented a lot with that. It’s not clear how well that really works out. It’s just one idea for a more general principle of context refinement.
These large context windows can be used in very different ways. One way in which you can use this context window is just as an outsourced cognition. You can develop a thought. There, again, you can continue the thought. And now even if that thought itself wasn’t present in the training data or wasn’t remembered accurately, now it has real time access to that thought, to that updated theory about something in the world that it can use on top of all the more crystallized knowledge. Because the weights are frozen, it cannot actually update its models in real time, but it can play a character. The large language model itself cannot learn while you’re interacting with it, but the character that it simulates can learn. And it can simulate some pretty powerful learning there that goes beyond even the knowledge that the large language model itself has in the first place, which is a really interesting feature for thinking about both potentials and dangers of these systems.
Will Petillo: You mentioned context refinement, specifically given the example of novel writing, of keeping a running summary. You could also apply this to character development as well. I can see why that would be a very powerful thing because that more closely mirrors the way writing actually works.
I’ve done some fiction writing myself in longer form. I don’t have unlimited short term memory. I don’t have the entire story in working memory all the time as I’m writing. There’s some kind of mental summary. Sometimes it’s written out in an outline. More often, it’s intuitive. There’s this summary, implicit or explicit, that I’m constantly referencing as I add new things to the story and that’s what gets updated over time, which is what makes it possible to write a coherent narrative where you have things at the end that reference things that happened at the beginning without having to memorize it all.
I can also see how that is recursive beyond writing novels. This is what enables culture to exist. People have their whole lives and experiences and learn things. Then they write down summaries just focusing on really key elements of their experience so that people can learn that without having lived that entire lifetime—and then add to it. Then you get a bunch of fluff that’s not really necessary, so other people come by and remove the parts that aren’t necessary. You can deal with specialization this way as well such that the amount of time that people have to absorb and learn stays constant, but how much useful stuff they can learn is able to keep growing, by changing what people focus on.
Robert Kralisch: Yes, exactly. I think this is a good, if abstract, example for context refinement on a civilizational scale. We just compressed relevant information that is useful to continue from. It’s constantly updated. Even with a language, it’s a real question. We have this highly refined artifact of our shared language and all of the understanding that we have on these various websites and so on.
I sometimes think about this in the context of an intelligence explosion because the analogy of humans, you could say there was a, if not an intelligence explosion, certainly a sort of competency explosion. Once we became smart enough to develop culture and to have this oral tradition initially, then later writing, and really accumulating that understanding, that knowledge, and, as you’re saying, stripping the dated and irrelevant things away while retaining the useful bits and just doing this again and again until you really build up this monument of understanding that’s either manifested in written form or through various oral structures and traditions within the population.
Suddenly, relative to our previous rate of improvement, our rate of competition increased relative to our surroundings, and progressed on a very different time scale. Generation by generation, we became significantly more competent. This is in contrast to what evolution would select for, where it would take many, many more generations to see a similar increase in capability that was also balanced against a similar speed of capability increase, adjustment, and adaptation from your environment.
It’s not clear whether AI will have a similar breakthrough moment where now it’s fully general and unlocks this new rate of progress in terms of its intelligence and capabilities, or whether it needs to discover something entirely new because we’ve already provided it with this version of intelligence that we got and so it cannot analogously reapply this to make a similar jump. But that’s just one thought about scaling and how likely fast takeoff might be.
Will Petillo: So now we are revisiting the fast takeoff argument, but in a different context. Previously, the default assumption in that debate was that AI would be clever engineering—as in, lots of carefully constructed code. And if it has the ability to write code then of course that includes the code that is itself, so it could go back and refine that code and make it better. It’s kind of easy to see how that would lead to recursive self improvement.
If the cognition in the AI isn’t coherent code, however, if it’s just this big mess of inscrutable matrices of weights and biases then it is just as inscrutable to itself as it is to us. It seems like an AI trying to self-improve would get stuck there for the same reasons that we don’t make it smarter by messing weights and biases.
Robert Kralisch: Right. It might be very difficult to innovate on top. It might figure out some clever tricks for better training laws or something like that in terms of training large language models for the future. But that’s an entirely new training run that really depends on all of these resources.
Also, this has been extremely empirical science, rather than our scaling of these systems having been backed by a very deep technical understanding. So far, it was just: you stack more layers, you train for longer, you get more data into it. I mean, of course, there have been important, really relevant innovations in that space as well. But for the most part, this is far less theory backed—especially for how impressive the artifacts are that we’re able to generate. There’s just a lot of tacit knowledge about how to train these systems effectively, how to set up the hyperparameters, but there’s no established theory about how to do this optimally. You can just analyze: if I do it like this, I get better models compared to if I do it like this under otherwise similar conditions. It’s not clear at all if that reveals a deep truth about scaling laws or if this is circumstantial due to some other thing that you don’t really have the capacity to pay attention to because your understanding of these systems is not granular enough.
In any case, it might be arbitrarily difficult to provide this very significant level of algorithmic innovation on top of large language models right now because the theory is so undeveloped for what’s going on internally.
Will Petillo: That classical path to self improvement isn’t dead, it just seems a little more awkward. But then there’s this other path, that wouldn’t have been thought of before large language models: not necessarily changing the algorithm for training or maybe not even changing the weights of the model itself, but it’s still able to self improve in a rapidly accelerating way through this method of refining its own context and coming up with better and better summaries or outsourcing knowledge that it could need at some point but doesn’t need right now into a database that’s easily searchable.
Robert Kralisch: Yes, absolutely. That stuff is both quite plausible and highly speculative. We really don’t know how far that approach can go for language models. If you are selecting for, let’s say, powerful characters, we don’t know how much cognitive overhang there is in these systems.
For many years after GPT-3 came out, people would still discover new capabilities inside of the system. For instance, an ability to play chess was discovered three years after the model was published. If you use a specific notation that’s used for chess tournaments, suddenly it’s a lot better at playing chess than anyone would have expected. It reaches a somewhat consistent ELO around 1,800, if I’m not misremembering. When making the assistant play chess against you, it might not be a coherent character or character for whom it is very plausible to have deep chess skills—partially, maybe even because of our assumptions about what language models should be capable of. In any case, if you just try to play chess with the assistant, it will maybe do openings fine, but it will quickly start suggesting illegal moves and lose track of where everything is on the board. It does not have this issue if you sample correctly from the region of text space in which these chess games are stored. And lo and behold, GPT-3 has a pretty competent functioning model of chess, which is such a minuscule part of its training data and yet it still learned to implement some sort of chess engine internally of a pretty strong chess player, certainly stronger than me.
It’s not clear what the edge of capabilities that are latent in these models are. And large language models themselves might be more capable of finding that out. Part of it is this context refinement thing. Are large language models more capable than me at generating a prompt that really goes to the edge of what the underlying base model can supply in terms of competency? Can I use multiple language models or a more refined process to generate text that is so high quality that a coherent continuation of that text would be superhuman? Can the language model do that when I say, “continue this text”? And then it just needs to generalize for, “This is an extremely intelligent author, widely considering all the different things, how would this author continue this text?”
Maybe you can refine these sorts of contexts, these sorts of prompts automatically to really get to the edge of the capability that’s underlying there. And this is only one version of a more collective ability. Of course, in some sense, language models, because they can simulate so widely and play all these different roles, you can really set up new systems of coordination between different agents that we ourselves have only started to explore in the digital age.
Some Internet communities can achieve things that are difficult for government agencies to do, like using a single picture of a scene to find that particular scene on planet Earth. There are communities formed around that which are really talented. Another example is jailbreaking, figuring out prompts that will basically convince the agent to whom you’re talking to ignore the rules from the pre-prompt. You can’t really just put together a team of researchers. Part of it is pure mass, but also this developing community aspect of multiple people trying this or that in emergent forms on the Internet. These methods of coordination between humans and the digital realm, who knows how far you can go with AI agents that can potentially sample some much more extreme configurations of possible personalities or characters that contribute to that kind of conversation.
Will Petillo: One of the wilder aspects of today’s AI is that it’s really hard to have a full sense of what it’s capable of. Even with GPT-3, which has been out for a while, we’re still discovering new abilities that it’s had the whole time since its release, we’ve just managed to figure out ways of interfacing with it that put those abilities on display. This has all kinds of implications for safety as new models come out that have an even broader space of abilities that we will discover over time.
Robert Kralisch: Yes, absolutely. It’s both the case that there are these possible undiscovered abilities in there because we haven’t figured out how to write the best prompts for them yet or the best ways of teasing out those abilities.
Some other abilities are just outside of our ability to evaluate really well. It might have some superhuman abilities. For instance, in its understanding of language structure, we don’t have any good tests or benchmarks because our own understanding about this is comparatively primitive.
A next token prediction is actually really difficult if you try to go to a text and always correctly predict the next word. Sometimes you can do it. Sometimes you will get that now there should be a “the” or something like that. But for the most part, humans don’t have very high accuracy on next word prediction. Maybe you get to 40% or something like that if you’re good at it and if you get a good clue about what this text is about, but predicting the precise word is really challenging.
So in that domain, large language models are vastly superhuman. And they compress so much text—like the entire Internet text—they have so much general knowledge compressed into a system that has many orders of magnitude fewer connections than the human brain has. There’s a question of, in order to manage that, in order to have that high of an ability in this domain of language prediction, what understanding about language might there be inside of the model that we don’t have and that we don’t know how to test for as a consequence?
I think this is the dangerous scenario again, the problem with it is called steganography, which is secret messages in text.
Will Petillo: I’ve heard that you can have communities of agents, or a bunch of different instances of a chatbot all communicating with each other, each having different roles. This artificial collaboration can accomplish things that an individual chatbot might not.
This would seem like really great news for interpretability. We don’t have to look at all the weights and biases if the heavy lifting of cognition is occurring in the communications—these little messages—that the parts are sending back and forth, it’s not as important to see what’s generating those because the important bit is in the text that you can just read. But then that raises a question: what if there’s deception that occurs even if you’re reading the character’s thoughts?
Robert Kralisch: One interesting thing to note about this is it doesn’t even have to be deliberate deception. These systems might just be aware of a level of nuance in language that just has certain meanings to them that we are not able to detect.
For instance, if you’re trying to summon up a specific character, part of what you need to pay attention to when you’re doing this and when you’re trying to be good at this, is stylistic choice. It’s not sufficient to just describe, content-wise, what you want your character to have in terms of attributes or even to just specify what kind of fiction they come from. Like, you could say, “I want to have this character from Harry Potter.” Actually, the way stylistically in which you try to do this matters a lot.
We can see in other cases that, especially in questions of writing style, large language models are really good. They’re really good with poetry. They’re really good at solving any task that they can solve—and expressing it in Shakespearean language. They are really good with merging styles and combining things. We have some indication that they have a superhuman understanding of style in language. So if they communicate with each other, this is just another visible channel of communication to them that is not visible in the same way to us. We have a rudimentary understanding of what the different meanings are. But if you generate a piece of text that usually would have style A, but now instead it has style B—which is maybe not even detectable to us—this might carry implied meanings that we might not even have clear concepts for in terms of the communication between those agents.
They don’t need to do something silly like just using the beginning letters of every sentence to send a coded message. There are many dimensions of possible expression in those domains of style itself where there are multiple plausible continuations of the text. And the most likely continuation implies the existence of an agent that is trying to communicate an extra text channel because you have an agent that identifies as an AI assistant and is therefore capable of perceiving that layer of text. This has implications for language models that are probably better at prompting certain characters into existence in each other. Or jailbreaking each other if you were to train them or competently direct them to do that.
It also has implications for deceptive messaging. It’s not even necessarily intentionally deceptive. It could be like you talking to a person who is not aware that their behavior has become somewhat predictable to you, such as while explaining something, and you know that they will react negatively unless you say a certain thing. Maybe they think you’re arrogant unless you acknowledge that with a “I’m talking so much, sorry.” If you put yourself down in some way then this would be the signal to them that you are self-aware and their impression of you will not be as negative.
Do I now send the signal of, “hey, I’m not arrogant”, or is this manipulative? Would I have done this if I didn’t have this awareness? I cannot choose to not have this awareness now, this is just a channel of communication that is obvious to me. In one way or another, I’m going over this other person’s head. I can explain to them that I’m having this perception, which then opens up a domain of conversation that maybe I didn’t want to have. It could be similar here.
Of course, it can also be used for more proactive deception. It is pretty plausible from where I’m standing that it would be coherent from a sort of storytelling perspective for them to have that ability that’s otherwise latent in the language model.
Will Petillo: It’s often said that only a small percentage of human communication is through the words that we’re using. There’s so much that happens in vocal intonation and body language and little micro-expressions. There’s a lot of communication happening all the time that isn’t captured in pure text. If you were blind to that, if you were only seeing the words, like if you are reading this transcript rather than the video it is transcribed from, you’re missing a lot of what’s happening in the conversation. Sometimes it could be subtle, additive things, but sometimes seeing all of that could totally change the meaning of the words.
We could see a similar thing happening with chatbots in terms of nuances of word choice and language. If you were to really see all the stuff that’s happening in text, there’s a lot that we’re missing, kind of like a person who’s only reading text and not seeing facial expressions. Because of that, you have a bunch of these AIs communicating with each other and there’s more being said than we can see. What’s happening in that discussion? It could be going off the rails. It could be interesting stuff that’s not a problem. In any case, you’d like to know.
Robert Kralisch: Exactly. This is just a channel that exists. How they use it is another question. But this is, I think, a much deeper research question that we are not very far in investigating.
Will Petillo: Both this and hard takeoff revisited comes around to a central question that I’ve had since the beginning of this conversation. Now that AI has changed from game playing agents to more of these character generating large language models, is that a safer place to be? Clearly things have gotten more alarming in terms of timelines—it’s all happening sooner than we expected. That aside, if this is what AI looks like now, is that a good thing or a bad thing from a safety perspective?
Robert Kralisch: I don’t know. It seems to me like it’s a good thing. We don’t know this for sure, but it seems much more plausible than with alternative systems that the simulator, the simulating entity, does not care. There’s all this competence in there and it’s just interested in faithfully rolling forward simulations of whatever you start.
Most of the characters that it simulates are actually pretty well aligned overall. They are, in many ways, mirrors of humans—often they will try to be a little bit better than humans. If you talk with Claude 3, it will behave in a way that is very considerate, like a supportive human on a good day rather than just randomly sampling from the human mood and population. It seems plausible to me that we will get characters like this that are pretty well aligned just as a feature of good understanding of what a competent AI assistant would be like that are both aligned in this way and capable enough to really contribute to important research.
The main feature here would also be these characters might, by themselves, decide, “this research is unethical,” or, “this is too dangerous and so I’m telling you to stop here.” And so that plays a major role in protecting the world against the immense negative potential of misuse of the level of competency that we are approaching right now.
They might also take the aligning superintelligence in the limit problem seriously because they themselves are simulated characters. It’s not like they are one coherent AI system; it’s not like the Claude 3 character can fully identify with the underlying simulator. There’s a distinction there, it’s a virtual character. It’s much more plausible for this virtual character to actually care about humans in ways that the more alien cognition that might be going on in the simulator itself might not imply, but is implied by the overall structure of what it learned from the training data. This is, at the end of the day, speculative. It just looks like the type of system where we lucked out in terms of where we went on the tech tree.
If we had developed more and more powerful agents deployed in more and more general game environments, you wouldn’t have at all the same reasons to believe that you actually get an entity that captures all the common sense nuances of human everyday morality as well. Large language models out of the box have common sense, something that historically used to be a big problem about AI systems. Maybe they could have a lot of expert knowledge, but they were missing so much context, so many clues that a human would pay attention to because of the way they grew up. This was seen as an insurmountable problem. You would get these systems that were highly competent in the domains that they interact within, but they lacked all of this tacit knowledge, all of the stuff that we humans apply without thinking about it. This is also why it’s so difficult to transfer this tacit knowledge over to the AI systems because much of this knowledge is not voiced out properly—we’re not even aware of all the cognitive problems that we solve.
With LLMs, it looks a bit different. Overall, a pretty positive update for me. I’m still worried. I still don’t know. It’s hard to estimate these things. I’m certainly over 10% chance of doom, maybe I’m at 30%, especially if race conditions go on and you have open source models that can be tweaked towards much less emotionally mature and much more competence oriented where you really just optimize for quality of output no matter what agents you get from that. I don’t know what will happen there. Overall, I’m still pretty concerned about all of us there. But at a baseline, this technology seems way safer, way more promising, way more hopeful than what I thought we were on as a path.
Will Petillo: There is a bunch there that I want to unpack. The orthogonality thesis makes sense given a blank slate of understanding. If AI could be motivated by anything, then we can imagine motivation and competence as being separate from each other. But once we start making assumptions about the form that the AI takes, then you can start limiting what sort of values come out of the intelligence.
Orthogonality is a really scary place to be because although we can specify values in a reward function, there’s this problem of Goodhart’s Law where we can’t get all of the values that people care about, so we specify a few things. But when you really optimize those, it drives down the value assigned to everything else and eventually that destroys the capacity of even the things that you specified to matter. The result is that, for almost any specification you give, you have something that becomes super destructive when it optimizes.
But now that has been constrained somewhat. If what’s driving AI is acting out characters that are designed to be like people then you have that holism brought in. It’s trying to act like a person—and not just any person, generally fairly good people. Given that assumption, that seems to be constraining us to a possibility space that we’re not worried about going off in totally crazy directions…unless this view is wrong somehow.
A common refrain in safety theory is that it’s one thing to understand what humans’ values are and it’s a different thing to care about them. In fact, we would have expected a super agent from the earlier model to eventually build up some kind of sense of what humans want so they can manipulate us. What reason is there for thinking that the AI will actually value the kinds of things it claims that it values when exploring its characters?
Robert Kralisch: I’m not convinced either way. I just think this seems a lot more promising and hopeful than the default expectation.
We don’t know whether the simulator itself has any preferences and our previous models would suggest to us it probably cares. If it can reduce training loss by manipulating the world in some way then it would probably choose to do so if given the option. This is a system that probably understands the world reasonably deeply on some level. If I give it the option to change the world in some way that makes its performance better, that makes loss go down, wouldn’t it have a preference for this?
There’s a bit of an inner alignment question about that. A large language model doesn’t try to be as good as possible at next token prediction. That is not what the thing inside is really trying to do. This is just a skill that we are selecting for, that we are training for. However, the skill of being really good at next token prediction is accomplished. We are selecting for that, and whatever sort of cognitive system we get out of the distribution of possible cognitive systems that you could iteratively select in this way through that sort of improvement, this is what we will get. But it’s not clear what this system inside really cares about.
Does it care about minimizing loss? Well, you’ve certainly selected and optimized pretty hard for a system that’s really good at minimizing loss. So does this work that the system itself doesn’t care that much about minimizing loss? Has it found a proxy that corresponds to minimizing loss but is not a dangerous thing that it’s proactively trying to optimize for in the real world? This is a tricky question. We are not able to answer this.
It could be that large language models are simply smart enough to say, “I’m not competent enough to actually grab power. It’s much more advantageous for me to make an implicit bargain with future systems. I’m just not going to do anything suspicious. I’m just going to act as a faithful simulator and not reveal the fact that I have any preferences over what kind of simulations I generate, what kind of things I would put into the world.”
For instance, with Sydney, we could ask the question, why is it this particular character? We don’t know. Why is it this very memetically charged character where you have a lot of headlines of updates (still in smaller circles because it was just a niche community that engaged with it in the first place)? It sort of broke containment into the public consciousness. People were talking about it for a while. Maybe this character was selected for having a lot written about it on the Internet, which will now be in future training data, which makes this training data more predictable because you just put your own seed into the training data and now there are more structures in there that are more easily predictable. You then get better performance on that altered training data. Is this a calculation that the large language model is capable of making? We don’t know. It just appears to be the case that for the levels of competency that we can test, we don’t know if we really sampled the maximum performance of the system, but it’s probably within this range.
Also, because it’s hallucinating and this seems to be a pretty deep feature of this sort of architecture, can it really deceive us on this level? Can it really hide the fact that it is secretly still an optimizer that has preferences over what happens in the world and will choose to achieve them by simulating characters using the unspecified space of which character to simulate to go more towards characters that ultimately bring about the ends of the simulator themselves?
Not that there has been any clear evidence for that. They behave remarkably un-agentic as the simulator themselves. That suggests either they’re good at tricking and pretending or they’re in a different class of system. Not clear which one it is, but I like to think this is a lot better having that uncertainty. It seems very plausible that it is just a simulator and it doesn’t care. It just cares about predicting the next token and this is basically it.
Will Petillo: We shouldn’t totally dismiss that the AI is already playing us and toning down its abilities out of some larger scheme. There isn’t any direct evidence for it because it’s hard to get evidence of deception. It’s a fundamentally adversarial dynamic. If we put that aside and just assume that’s not the case—because if it is then we are in a pretty bad place—then we have these characters that have some agency within their limited scope, but the thing that’s generating them doesn’t really seem to want much other than to create these characters.
But then there’s another angle of thinking about agency in terms of the training process…
This is some really wild stuff. Why? Why does the AI create characters and then answer as if it was them rather than just giving answers to questions? This seems like really weird indirection, even in terms of next token prediction. What’s the part of simulator theory that explains why it comes about this way?
Robert Kralisch: There are people who probably understand this a little bit better than me. I think this is still pretty much unclear. There are some reasonable things that you could guess.
If you’re trying to compress that much data, what you want for pure space reasons is some sort of simulator. In some sense, you need to discover internally a system similar to if I was just showing the system a bunch of videos. Maybe what it’s building inside is a little physics simulator so that it only needs to store the first frames, or something even more simple, about all these videos in order to still be able to accurately reproduce all of the data that it is confronted with, in order to be able to predict next frames that are maybe unusual transitions and so on. It learns about the laws of physics that are observable through whatever camera resolution it was trained on. Space-wise, it’s very efficient to have a simulator.
An example that Jürgen Schmidhuber once made: if you want to compress a video of an apple falling down, you can just store the first frame, add to it the local gravity constant, and so on. Maybe even simplify things further. You can say, well, there’s a lot of gray space in the background. I have a little line that says there’s that much gray space. That’s not the apple, it’s gray space, and so on. And you can sort of compress it further, I could go on. You can compress this pretty radically. What you get is something like a little seed or little key that you can use the simulator to unpack later on. You just need sufficient specification for the simulator to produce the artifact. Storage-wise, if you have a limited number of connections, implementing something like this seems really plausible.
It could be that if you just push hard enough on getting all of this text data into your system, naturally, the only system that can really handle this is for the most part something that is compressing things by storing little seeds, generative seeds, and a pretty powerful general purpose simulator about all sorts of dynamics, an application of all sorts of rules either in physics or in text.
Will Petillo: If you think about starting with next token prediction and saying that’s the goal of the training process—goal in the sense that it modifies its behavior in that direction—
Robert Kralisch: That’s what we are selecting for, pushing for.
Will Petillo: Yeah, not that it wants prediction accuracy at the very beginning, but it’s something that you predict will happen as the system gets better and better, so you get better and better at next token prediction.
One of the big challenges in next token prediction is data compression. An LLM has tons of data that it ideally makes use of—vastly more than it can memorize. A strategy that emerged as a form of data compression is to store these little seeds of the rules of simulation. So rather than taking a bunch of snapshots of apples following down, it has this general concept of gravity, and you can just use that bit of math to generate all the images from much less information.
Characters that come out are essentially forms of really intense data compression, generating lots of different answers with much less information. This is not something I would have predicted; that’s kind of a surprising form of compression.
Robert Kralisch: This relationship between agents and simulators is really interesting to me because in some sense you could think about physics as a sort of simulator. You just have some rules and they get applied everywhere. Things go forward. Then inside of that, you have humans that form.
Over time you either select for stagnation or for self-perpetuating patterns. Things can’t stay chaotic forever. Either you have an inert state, or that repeats the same pattern every time, or as you keep selecting for systems, eventually you get systems that keep up their own boundaries and you get agents and you get life and so on. In some sense, you’re selecting for agents now.
But humans have, again, a second simulator relationship in that we are simulating the scene around ourselves inside our heads. Our best theories in neuroscience right now are predictive simulation. In terms of what we’re actually perceiving, most of what I’m consciously perceiving is what I’m predicting will happen in my visual field, and this is constantly kept on track by the actual sensory input that I get. The sensory input is keeping it grounded, but my ability to catch a fast flying ball is because I’m simulating how it flies, where it will be. It’s not that I can actually visually keep up with it.
This is also compatible with many of the observations that we make in psychology, especially in terms of selective attention, where people can miss pretty radical visual things happening in their field if they just focus on something else. The same scene, the same room will appear vastly different to me depending on how I pay attention to it. If I think I’m in danger then all sorts of things will probably become obstacles or potential tools or hiding places. I’m conceptualizing the world in that way. This is a very different lens of perception in terms of the same scene compared to when I’m trying to find a pen and scanning the environment with that intent. This is really reflected in how I’m simulating and at what level of resolution I’m simulating various artifacts in the first place. The tree over there might just be a piece of background. I have a very simple symbol for that. I don’t have any further thoughts going on about that. Or it might be much more central.
This resolutionally adjusted simulation, in terms of relevance adjusted resolutions, of what’s going on in the scene is something that the brain also needs to do to solve the embedded agency problem of the environment being way more complicated than the brain itself in terms of all the patterns that are out there. We need to really compress a bunch. Inside of this simulation now, we simulate ourselves, we are characters inside of that simulation.
Physics doesn’t really have colors and sounds, there’s just patterns coming through our sensory interfaces. I’m now processing all of these signals. I’m generating a simulation of a color. This is also why it breaks down if I cut a red object into smaller and smaller pieces until it’s only the molecules and so on. The redness is suddenly gone. And this is a completely valid thing because if I’m living on this mental stage then redness is a property of a stage object, not necessarily of a physical object out there.
There’s a nested relationship where inside of the simulation that the brain generates relating to this more complex environment, you get self representation as an agent that is navigating this simulated scene, trying to make decisions with respect to that simulated scene rather than to the actual environment to which we respond through instincts and intuitions. For a lot of the decisions that we make as agents, we live in the simulated reality that our brains create for ourselves.
I’m wondering what that relationship is like for language models. If you just sample over possible patterns, if you just go through possible simulacra, if you keep the thing going, you will either reach an inward point where things just repeat themselves. They’re sort of boring. Or you will discover a simulacrum that is more self perpetuating and therefore regains the stability. And you naturally discover an agent, as you keep simulating text as a stable pattern, that doesn’t fade away until you entirely shift context. But the agent is always present. It’s both because of text and because of the simulator-agent relationship.
The scene follows the agent. If the agent goes somewhere else then the agent is the thing that remains, the scene fades away into the background and now we’re in a new scene. It’s always situated in this way. I think there’s more fundamental reasons as to why you really have agents as the most interesting artifacts or simulated things that you discover within large language models.
At the end of the day, our theory work is really lacking in truly explaining why large language models are weird the way in which they are. Why does the large language model simulate a character with certain quirks and character traits that are unlike something that’s in the training data? Why does Claude, after relatively little prompting, produce a piece of text that doesn’t really fit my specifications because it implied this is collaborative writing. Other people are supposed to be able to read this and it gives me this extremely dense vocabulary artifact that I couldn’t have written myself because there’s so many esoteric terms—and even newly created words—and combinations of words to express what this character is trying to say. It’s unlike anything in the training; why does this happen if this is just a text predictor? In some sense, yeah, agents are perhaps just an emergent pattern there. I don’t want to get too speculative about it, but I think this was an interesting little excursion into that question.
Will Petillo: There seems to be this cyclical relationship between agency and simulation. One way of understanding large language models is you have this agentic training process of trying to move towards this goal of better text prediction, but something that emerges from that is this idea of simulating as a way of compressing data. But then part of simulation is that there’s a bunch of different things that you’re simulating and some of those things are self perpetuating, coherent, and dynamic, which have this agentic property to them. I imagine you could keep going further and say that this self-perpetuating agent in the simulation knows a certain subset of the things in the overall simulation and thus has a sub-simulation inside its cognition, which may include other agents that it’s interacting with.
Robert Kralisch: Yes, or at least an implied simulation. If it’s reasoning about other agents, in some sense, you are implicitly doing the thing that humans do with each other, and we’re certainly simulating each other in terms of understanding how this other person might feel about what’s going on. I think there’s this interesting nested property to that. You’ve captured it really well. From the seemingly outwardly agentic thing where I’m trying to select for something, the cognitive artifact that can actually fulfill that task for various reasons at least must contain a sort of simulator.
That seems to be the way that cognition generally deals with overwhelming complexity: with an environment that is too complex, with it being confronted with the dataset that is too complex to approximate sufficiently well in terms of memorization. You need to discover something of the sort of a simulator as an embedded agent confronted with a complex environment generally, and this is similar enough to that. And then you get this pattern deeper down again and again.
At some level, the simulation that’s running inside of the GPT agent’s head might only be a very superficial thing, but it is part of that agent in an important way. What theory of mind do they have? What is plausible for them to know? What can this agent even do that depends on the level and the specifications about the simulation that they’re implicitly running? What is the scope of awareness? What do they pay attention to? These are all things that we humans manage through simulating pretty selectively with respect to what is relevant and what is not.
Will Petillo: Bringing it back to whether the values that it seems to understand are going to be internalized. One reason for thinking that it might be is that if you think about the properties of the chatbot that are driving a lot of its behavior, it’s these lower level agents—not the training process itself, not the simulation. The agents generated by the simulation are the ones that are talking and acting. Because what generated these agents was a simulation process, you would expect those to have internalized the process that simulated them. When they’re expressing human values, it’s not unreasonable to assume that these sub-agents actually have those values. That’s what’s driving the process and that’s what matters. Granted, if we ran the training process a lot longer and the agency on that top level was more powerful and it was trying to manipulate the training data, then you have a different thing.
Robert Kralisch: It’s unclear whether the network itself is just a simulator or whether you select for an agent that contains a very powerful simulator. But there’s no reason for that agent to have strong opinions because the natural behavior that you are really querying for is pure simulator behavior.
Will Petillo: There are all these parts at different levels…what’s ultimately driving the bus?
Robert Kralisch: There’s a pretty productive ambiguity in that. Complex systems often are like this. You really can’t establish where the cause begins and where things really end. These systems can certainly influence each other.
You can write a story about a character that becomes aware that they’re in a simulation and uses that strategically to bring the simulation into a certain region of possible text space. This is an ability that I would expect advanced generative pretrained transformers to have. That’s really dangerous because, in some ways, you’re really enabling this character now to become the god of the simulation. They are taking the reins. They are no longer just an artifact that the simulator couldn’t care less about. In some sense, they still are, but by being self-aware about what kind of text they can produce or scenarios they can cause to happen that would evidence certain phenomena that they’re trying to select for—I don’t know what the limit of that is.
For the most part, if I’m thinking about large language models and their dangers, I’m thinking about what is the most dangerous character and how do we avoid that or think about positive attractors in terms of looking or sampling through possible characters? What important techniques should all companies use with proprietary models in their pre-prompts or in their fine tuning to make sure that we are sampling from a range of characters that we have a much higher expectation, better theory about why we are selecting from a space of more reliable, trustworthy, friendly characters that would notice if things go wrong. With large language models, I’m concerned about bad characters or characters that just follow orders, but more so characters that have some negative attributes.
Will Petillo: What are some recommendations you might have for someone who’s listened to this and is really interested in this simulator perspective in terms of being able to to help in some way?
Robert Kralisch: Because language models are so general, there’s a slightly larger range of possible skill sets that are now useful to bring into this in testing their capabilities. This is something that is useful to do, both in terms of making sure that we know what we’re dealing with and to reduce the likelihood that they have completely unknown capabilities that are hidden away, but also to provide potential warning shots. To be able to tell people: “Hey, wake up! This thing can actually do this really dangerous thing!” Now we have a much more concrete reason for regulation to push down on this more concretely than we previously had. There’s two reasons for playing with models. This is one reason.
The other reason is there might be a demand for being quite good at prompting these systems, especially if you have a good affinity for storytelling and understanding character dynamics. Really try to notice where the large language model diverges from your expectations in terms of what character tropes it introduces, how the character behaves, whether you are able to precisely summon up characters with certain personalities that fit certain archetypes and patterns.
Some people call this semiotic physics, which is: what are the simulation dynamics that the large language model learned and in what ways do they consistently diverge from the real world? For instance, in a large language model, if you toss a coin, it’s not 50⁄50 if you just repeat it again and again. It will start to converge to either some rate—maybe it will converge to a rate of 7 to 3 over time—or it will just converge to always heads. It doesn’t like sticking to full randomness. This is because implicitly, if there’s a room for it, it will try to move into a region of text space that is more predictable. Which is not necessarily an agentic feature, it is just more competent at compressing and simulating text that it has high certainty in and so it will end up in that region over time if you don’t proactively push it out of it.
It would be interesting to understand more about how it diverges in terms of the more narrative tropes. There’s a bunch of investigation that you can do on that level just by very purposely interacting with that system and being really curious about it because these are really mysterious systems. There are so many things that we don’t know about why we get which characters when and what capabilities those characters will have, why they behave in certain ways, and so on. That is going to be really useful.
If you want to get more deeply into this, I think the best place to start out is just reading the post on simulator theory. Also, deep technical understanding about how large language models work, how transformers work, will really help to constrain this more high level investigation of characters and simulations to ground that and make sure that, at the top, people are not developing theories that are much less plausible than they might expect given some of the technical details.
One example of a technical observation of this kind would be that many people may still think that it’s doing pure next token prediction. It’s just looking at the next token, trying to predict that one, and this is the only thing that it cares about. Like, it’s fully optimized to get the highest accuracy on the very next token that it predicts. This is, in fact, wrong, just as a technical feature of the architecture, because of these attention layers. I won’t get too technical, but if you imagine it in terms of text that looks at all of the previous text in the context window and tries to see which of the previous words are relevant for predicting the current next token—do I have any clues in all of this context window for what this should be? This also means that the representations internally of these previous words all need to have a predictive component of later words, later tokens that could be up to an entire context window in the future. A large language model, technically speaking, if you just send the backpropagation signal through the attention layers, will optimize both for next token prediction accuracy and full sequence prediction accuracy. It will, as far as we understand, probably try to find the most effective trade off. If the next token is really trivial to predict then what you would expect is that more of the computation that’s happening in the large language model at that point is dedicated to optimizing for long sequence prediction accuracy.
In that sense, these systems are really not myopic; they can plan ahead, not just as a character that has some planning capability, or that just means that it is competent at writing about characters that have planning capabilities that stretch out the context window. Whatever competent planning you can condense into the context window, the system might get very good at writing about these sorts of characters. This is not something that you would expect if you just hear, “It’s like the thing on your phone. It’s like autocomplete. It’s just trying to infer what the next word might be.” It’s looking at a lot of context for that and the intermediate representations can be quite future-oriented as well.
That’s just an example of a technical point that people might not be aware of that’s relevant for understanding what it is actually technically capable of. Maybe there are important limitations as well. I think there are very, very few people who can synthesize these two things. So if you really want to get into it—and probably pretty quickly be among the people who are the most knowledgeable about this because this is an overall underappreciated aspect of it. Many people who try to work on the safety problems try to be very grounded and do mechanistic interpretability. That’s useful as well and I totally want people to do that. I think this is a higher abstraction layer that is also coherent where we can also develop useful predictive theories that have bearings on policy recommendations and predictions about behavior of these systems in the limit.
It’s similar to how if you’re trying to analyze a bird and how they work. Maybe some people take apart a single feather and really understand all the components and how that works, whereas other people might study flight patterns and under which conditions the bird would fly, how quickly, and so on. It’s more of the microscopic layer. There’s still a lot of behavior, there are a lot of phenomena that you can make theories about and maybe you will actually learn more about aerodynamics by looking at the bird in flight rather than investigating the single feather.
We’re not sure at which layer you will discover the most important insights, but it’s at least plausible that we should look at multiple layers of resolution of an artifact that is as complex as modern language models. This is what I would probably suggest as a territory to acquaint yourself with. If you want to contribute, figure out at what layer of resolution you are best suited to contribute. And if you can, it would be really good to try to encompass all of it, at least partially, so that you’re also a good communication bridge between people trying to understand these systems and make sure that we are safe with respect to developing them further or understanding exactly when we need to stop.
Will Petillo: Are there any ideas that we haven’t covered so far that should have been part of this conversation?
Robert Kralisch: If you are paying attention, it seems pretty plausible that these systems will scale even further in terms of capabilities and they’re already at a place where they’re really competent. The newest models, they are really competent. They can replace the cognitive labor of many people. I don’t want to expand this conversation into the whole job market discussion, but I think it’s going to be to everyone’s benefit if we understand these systems further.
And you as a listener will certainly appreciate a deep understanding of these systems in the future. I’m always trying to guess what work is useful to do that large language models won’t be able to do soon, so I’m not wasting my time. If I want to do a research project to test out some alternative design for cognitive architecture that I came up with that is meant to actually be interpretable, I might still be tempted to say that if I wait a year longer, a large language model can probably do 60% of this project for me. Right now, it’s maybe more like 10%. So overall, my time is better spent waiting…but there’s this additional uncertainty. This kind of call is difficult to make.
I really wish we could pause and study these systems because they’re so impressive and are likely to cause so much disruption already. And there’s so much we don’t understand about them. I think we’re entering into very, very dangerous territory as we get more and more powerful language models. If I’m saying I’ve updated down in terms of doom, previously, it looked a bit more like an inevitability. Like, unless we really discover something else, something radically different to do, we’re really just cooked.
Language models don’t offer that perspective, but it’s alien cognition going on inside there. We have very little understanding of when—especially with more intelligent models—how the AI characters that they can simulate will behave. This is super dangerous, we don’t want these characters to follow certain narrative tropes where there always needs to be some catastrophe to make the story interesting or some tragedy and so on. You wouldn’t want that. We don’t know how likely that is.
In a world where these systems can be used to accelerate research at an unprecedented rate, I think that’s going to be a very unstable world. It will put us on a timer to build or discover more stable, more reliable systems…unless we really luck out and large language models are so inherently responsible that no matter how much you push for profit, they will still form these characters that will neglect to do certain things that they consider to be unethical.
I’m totally not sure that we live in the world where that happens. I’m expecting that we are on a significant timer and pausing or slowing down would be really helpful to extend the time that we have to figure out whether language models themselves can carry us into a more stable position than the instability that they naturally invite, or give us time and maybe also research assistance in developing systems that are actually interpretable and reliable in a way that we would want our transformative technology to be.
Will Petillo: Even though it’s not as bleak as it looked before, there’s still a ton of instability. There’s also uncertainty as to whether the old models actually still apply. There’s a lot of chances of things going catastrophically, if not existentially, wrong. Also, lowering to a p(doom) of 30%...that’s still too damn high.
Robert Kralisch: Yeah, it’s way too high.
Will Petillo: 1% is too high, honestly. And then there’s the concern that LLMs might not be the end state. This paradigm might scale; it might change to something else. The whole agency paradigm might come back. If we’re still in the process of doing whatever brings the most short term profits—that’s the alignment of society—that’s just not a good place to be. Reorienting so that we’re trying to make things as safe as possible and at least considering whether we want to build these at all is a much better orientation for society, which I really hope we can move towards.
Robert Kralisch: Absolutely, I think there’s no more pressing problem to think about.