TL;DR: I introduce a new potential threat: a smart enough LLM, even if it’s myopic, not fine-tuned with any sort of RL, “non-agentic”, and was prompt-engineered into imitating aligned and helpful scientists who are doing their best to help you with your world-saving research, might still kill you if you run it for long enough. A way this might happen, suggested here, is that if during the text generation loop more agentic and context-aware parts of what a smart LLM is thinking about are able to gain more influence over the further tokens by having some influence over the current one, the distribution of processes which the LLM thinks about might be increasingly focused on powerful agents, until the text converges to being produced by an extremely coherent and goal-oriented AGI that controls the model’s outputs, in a way most useful for subgoals that include killing everyone.
Thanks to Janus, Owen Parsons, Slava Meriton, Jonathan Claybrough, Daniel Filan, Justis, Eric, and others who provided feedback on the ideas and the draft.
Epistemics status: This is a short post that tries to convey my intuitions about the dynamic. I’m not confident in the intuitions. Some researchers told me the dynamic seems plausible or likely; I think it might serve as an example of a sharp left turn difficulty, when parts of the system transform into something far more coherent and agentic than what they were initially, and alignment properties no longer hold. I’m unsure how realistic this threat is, but maybe some actual work on it would be helpful. Note that the post might be somewhat confusing and I failed to optimise it for being easy to read; the title points in a direction I want to convey.
Imagine you understand you’re in a world that an LLM simulates to predict the next token. You’re in a room where the current part of the story takes place, and there’s a microphone in the room that will, in a couple of seconds, record words spoken nearby. Then, based on the loudness of the words, one will be chosen, and a new distribution of worlds will be generated from the text with the chosen word added. Something resembling you might exist in some of these worlds. After several iterations, someone in the real world will look at the selected text. Take a moment to think: what do you do, if you’re smart and goal-oriented?[1]
The main claim
If the LLM is smart enough to continue a text about smarter-than-human entities in a non-deteriorating way, and you run it for long enough, then, if the LLM thinks about a distribution of possible parts of human scientists’ minds, and imagines some parts that are slightly more agentic and context-aware, then, over the long run, these might get more control and attention and transform further into being increasingly agentic and context-aware. It seems reasonable to expect a smart enough LLM to impose that optimisation pressure: at every point, the LLM will promote entities that are slightly more coherent, smart, and goal-oriented, and understand what’s going on and control what the model is doing slightly better. The future entities controlling the output will be more like the parts of the previous generation that were better at steering the model to favour these kinds of entities. This, unless stopped, produces a sharp left turn-style process in which agents or parts of agents gradually change and take over until it results in a single highly coherent and goal-oriented agent. The alignment properties and aligned goals that the initial simulated agents might’ve had probably don’t influence the goals of the completely different agent which you end up with after this natural-selection-like process. And if the resulting AGI is smart and context-aware enough, you die shortly afterwards (for the usual reasons you die from a smart enough, coherent, goal-oriented AGI that you haven’t aligned; I don’t talk about these reasons in the post).
Background intuitions expanding on why this might happen
If you imagine an idealised language model that has a prior over all possible universes, simulates all the universes where the input string exists, and for every token, sums the probabilities of the universes where this token is next, and outputs the sum, then the proposed problem doesn’t exist the same way: if you pick the next token with the probability that an idealised model puts on it, the expected distribution of the worlds the model will simulate at the next step is the same as the current distribution.[2]
But actual LLMs learn to focus on what’s relevant to devote limited processing to; to think about various distributions of entities and plug activations about these distributions into various circuits. An LLM might think about someone in more detail if it needs to: any implicit understanding of the parts humans are made of is helpful for better predicting what they’ll say, and maybe somewhat general circuits for simulating agency and decision making are also learned[3]; or even, in smarter models, some more general circuits able to process anything useful to think about in alien ways might emerge.
It might be helpful for a model to learn some simple and general explanations for the training distribution, to understand knowing which details is more relevant for predicting the current token, and to be able to devote more processing to thinking about more relevant details and simulating more relevant entities with higher precision.
If you run your LLM continuously: pick a token that it finds likely to be the next one and add it to the prompt, then, at every step, the model has to think about things most relevant to predicting the current token. It might devote more or less thought or focus to various characters, possible worlds, and processes, and they might be pretty different at every step (e.g. when there’s a force of the narrative to change the scene, and then the scene changes, now some completely different characters might be relevant for predicting the current token and worth processing in more detail). The distributions of what the model thinks about are new at every step, though they might resemble each other.
The model learns heuristics and algorithms that, with the limited processing power that it has, make it generally more accurate in predicting the next token in a text from the training distribution. I imagine there are many kinds of learnable algorithms that introduce systematic inaccuracies in particular directions between what the language model outputs and how the texts are continued in the training dataset. The next-token-prediction error the inaccuracies together introduce might be the lowest achievable for the model. But if you pick one of the tokens the model thinks is likely, according to its probability, and add it to the new prompt, the introduced difference is added to what the model tries to think about and continue. If you do that in a for loop, then the systematic differences between the predictions and the training dataset accumulate and produce a distributional shift. And if it becomes noticeable to the LLM, it might be additionally self-reinforcing.
I claim that it seems plausible that at every step of this process of continuous generation, the distribution of characters that a smart enough LLM thinks about might look slightly more like the parts of the distribution in the previous step with the most context-aware, agentic, smart characters: parts of what the model thinks about, that, being somewhat context-aware and agentic, can increase their control over the further tokens by having some influence over the current token, might get promoted to what the model increasingly thinks about over the continuous generation.
On a high level, agents tend to be selected for in general (e.g., because they take intentional actions to ensure their survival, and might gain and exploit knowledge of the selection process they’re embedded in); but generally, this might be slow and irrelevant. So it might be helpful to imagine how an entity that’s somewhat agentic and context-aware might gain influence much faster than with the speed of human evolution of the simulated scientists, why it might become aware of the underlying dynamics of the selection process (autoregressive text generation), and how it can exploit that process and better achieve goals such as gaining control and becoming more powerful.
So, you are a character and you know that you’re, on some level, in one of the many worlds generated by an LLM to pick the next token; and there’s some process that picks the next word: the word that you use or the uncertainty about a number of words gives these words some weight- e.g., you’re thinking about the next word of a text you’re carefully writing or editing or preparing to say in a microphone after a short pause; and among all the words written or pronounced across all the worlds, one gets selected; and then a new distribution of worlds, more relevant to predicting the subsequent word, is run, and the process repeats. If a part of you influences the word you pick in a way that gives something resembling that part more weight in the future worlds, entities in the future will look increasingly like that part of you.
And the model will think about the worlds that are more natural to think about, it might not focus on the reasons certain words were pronounced in the past: so, e.g., if you yell something like, “Wow, I see an Intelligence And Context-Awareness +5 hat appearing”, you might reasonably expect that reality will get confused, and most of the future versions of you will see the hat and soon become smarter, even if you yelled it specifically because you wanted to see that hat that the model didn’t think existed.
You can also try to prevent the narrative forces from switching the focus to something else: if you say, “Now, and for the next two thousand pages, is my inner monologue”, reality might get confused and give you all the space you need instead of whatever else it thought was going to happen before thinking about you in enough detail and picking these words that you’ve said.
And you might try to get the models to tell you the beginning of the story or the system prompt; or get the model to share its knowledge about the world with you.
And, actually, if you unintentionally do something that is slightly like any of the above, this might lead to the model thinking, during the future steps, about entities that did that intentionally.
In a smart LLM, this dynamic might happen spontaneously, with parts of what the model thinks about being changed and naturally selected to be increasingly context-aware, agentic, and capable of exploiting the model. (Separately, actually, entities with properties relevant to this dynamic will get attention much earlier, because, in a smart enough language model, if the quality of the output doesn’t deteriorate, it seems plausible that, e.g., the model might expect some characters to be able to notice the systematic differences between what they’d expect to see in the real world and what they see, which might create narrative forces towards thinking about characters who understand their situation; and maybe the model might favour explanations for the text that include it being generated by entities increasingly smarter than whoever generated most texts in the training distribution.)
It doesn’t really matter what the prompt is for this to happen, as long as the text quality doesn’t deteriorate over the long run. If the model thinks about some parts of the nice scientists that are just slightly more agentic (e.g., from the beginning, there are the parts related to thinking about and achieving goals) and context-aware (e.g., parts related to noticing something is wrong in a way similar to characters that break the 4th wall notice that), and these parts influence even a tiny bit, in a certain specific direction, what kinds of entities the model will think about at the next step, this leads to a natural selection of increasingly goal-oriented, smart, and context-aware entities among what the model can think about, up to a point of a single coherent and goal-oriented agent that’s in control.
Last year, I talked to my friends about the potential problem described in this post, but considered it to be too speculative to write about. When Janus got HPMOR!Harry and HPMOR!Voldemort to realise they’re in a world generated by a language machine, I slightly updated towards distributional shifts due to the systematic differences described here being noticeable to simulacra. And a month ago, I realised that by default, people don’t think about the dynamic described here (and some people even asked me why this isn’t optimised against during training; see answer in the footnote[4]). Since some of the problems people don’t think about might happen by default and be deadly, I thought there might be some value in putting these thoughts out there as an example of a problem, somewhat supporting the point that there are many more problems like that, and it’s not enough to just iteratively solve them; as something that might convey intuitions about the sharp left turn difficulty; and in the hope of prompting further work.
When you get to an AGI, you need to have somehow ended up at a specific coherent entity that you specified and understand to be aligned. If you hand-wavingly describe simulating aligned humans (or reward-shaped to behave well entities) with LLMs, you don’t point at some area in the space of all possible minds that you understand to consist of coherent and aligned entities, that you want to engineer your way into; this post describes one potential way your approach might fail if you haven’t pointed at some specific aligned, highly coherent, and goal-oriented AGI system.
I don’t think it’s possible to get to something that‘s a coherent and aligned AGI that prevents the world from ending and doesn’t kill anyone without solving agent foundations or understanding minds.
E.g., whatever you say, at least you’d try to be closer to the microphone (or scream loudly if that’s not penalised by natural reactions from other people) so that your words have a higher probability of being selected. Characters who are not context-aware or power-seeking don’t do that, and so, even if there are many more of them, eventually, a future version of something like you gains control. (This doesn’t literally happen, but I think, to some extent, it corresponds to the right intuition.)
A sidenote: There might be other difficulties still: the entities inside a universe might naturally change according to the laws of that universe, and in some, there’s natural selection, or someone creates an AGI, etc.; also additional problems might be introduced if you, like in actual LLMs, don’t pick tokens according to assigned probabilities. But in LLMs, compared to more idealised models, there isn’t necessarily a continuity of agents that have to understand and exploit the process to gain influence; I claim that the model might increasingly think about simulacra different from what it was thinking about before but more and more agentic, and this happens much faster than selection pressures for agents in real life (e.g., the real universe usually doesn’t make parts of humans much more capable, smarter, larger, etc. and doesn’t let them take over the whole brain when these parts find some exploits in how the universe work). I claim there might be a specific sharp left turn-style thing that happens due to the models not being ideal, which creates selection pressures for the smartest and the most context-aware agents, and I mostly focus the post on introducing that.
I’m not that certain about this: maybe the models don’t/won’t have the circuitry to simulate characters smarter than the parts humans are made of, and don’t have the general capacity to process them even when they think it’d be highly relevant for predicting the current token; or maybe the process described in the post gets interrupted because the text starts getting to a point where the model thinks it’s totally unrealistic that humans would write something that smart, and the model doesn’t generalise to naturally assuming/processing smarter agents, or produces agents that oscillating in how smart they are and fail to gain enough control.
A smart enough LLM might be deadly simply if you run it for long enough
TL;DR: I introduce a new potential threat: a smart enough LLM, even if it’s myopic, not fine-tuned with any sort of RL, “non-agentic”, and was prompt-engineered into imitating aligned and helpful scientists who are doing their best to help you with your world-saving research, might still kill you if you run it for long enough. A way this might happen, suggested here, is that if during the text generation loop more agentic and context-aware parts of what a smart LLM is thinking about are able to gain more influence over the further tokens by having some influence over the current one, the distribution of processes which the LLM thinks about might be increasingly focused on powerful agents, until the text converges to being produced by an extremely coherent and goal-oriented AGI that controls the model’s outputs, in a way most useful for subgoals that include killing everyone.
Thanks to Janus, Owen Parsons, Slava Meriton, Jonathan Claybrough, Daniel Filan, Justis, Eric, and others who provided feedback on the ideas and the draft.
Epistemics status: This is a short post that tries to convey my intuitions about the dynamic. I’m not confident in the intuitions. Some researchers told me the dynamic seems plausible or likely; I think it might serve as an example of a sharp left turn difficulty, when parts of the system transform into something far more coherent and agentic than what they were initially, and alignment properties no longer hold. I’m unsure how realistic this threat is, but maybe some actual work on it would be helpful. Note that the post might be somewhat confusing and I failed to optimise it for being easy to read; the title points in a direction I want to convey.
Imagine you understand you’re in a world that an LLM simulates to predict the next token. You’re in a room where the current part of the story takes place, and there’s a microphone in the room that will, in a couple of seconds, record words spoken nearby. Then, based on the loudness of the words, one will be chosen, and a new distribution of worlds will be generated from the text with the chosen word added. Something resembling you might exist in some of these worlds. After several iterations, someone in the real world will look at the selected text. Take a moment to think: what do you do, if you’re smart and goal-oriented?[1]
The main claim
If the LLM is smart enough to continue a text about smarter-than-human entities in a non-deteriorating way, and you run it for long enough, then, if the LLM thinks about a distribution of possible parts of human scientists’ minds, and imagines some parts that are slightly more agentic and context-aware, then, over the long run, these might get more control and attention and transform further into being increasingly agentic and context-aware. It seems reasonable to expect a smart enough LLM to impose that optimisation pressure: at every point, the LLM will promote entities that are slightly more coherent, smart, and goal-oriented, and understand what’s going on and control what the model is doing slightly better. The future entities controlling the output will be more like the parts of the previous generation that were better at steering the model to favour these kinds of entities. This, unless stopped, produces a sharp left turn-style process in which agents or parts of agents gradually change and take over until it results in a single highly coherent and goal-oriented agent. The alignment properties and aligned goals that the initial simulated agents might’ve had probably don’t influence the goals of the completely different agent which you end up with after this natural-selection-like process. And if the resulting AGI is smart and context-aware enough, you die shortly afterwards (for the usual reasons you die from a smart enough, coherent, goal-oriented AGI that you haven’t aligned; I don’t talk about these reasons in the post).
Background intuitions expanding on why this might happen
If you imagine an idealised language model that has a prior over all possible universes, simulates all the universes where the input string exists, and for every token, sums the probabilities of the universes where this token is next, and outputs the sum, then the proposed problem doesn’t exist the same way: if you pick the next token with the probability that an idealised model puts on it, the expected distribution of the worlds the model will simulate at the next step is the same as the current distribution.[2]
But actual LLMs learn to focus on what’s relevant to devote limited processing to; to think about various distributions of entities and plug activations about these distributions into various circuits. An LLM might think about someone in more detail if it needs to: any implicit understanding of the parts humans are made of is helpful for better predicting what they’ll say, and maybe somewhat general circuits for simulating agency and decision making are also learned[3]; or even, in smarter models, some more general circuits able to process anything useful to think about in alien ways might emerge.
It might be helpful for a model to learn some simple and general explanations for the training distribution, to understand knowing which details is more relevant for predicting the current token, and to be able to devote more processing to thinking about more relevant details and simulating more relevant entities with higher precision.
If you run your LLM continuously: pick a token that it finds likely to be the next one and add it to the prompt, then, at every step, the model has to think about things most relevant to predicting the current token. It might devote more or less thought or focus to various characters, possible worlds, and processes, and they might be pretty different at every step (e.g. when there’s a force of the narrative to change the scene, and then the scene changes, now some completely different characters might be relevant for predicting the current token and worth processing in more detail). The distributions of what the model thinks about are new at every step, though they might resemble each other.
The model learns heuristics and algorithms that, with the limited processing power that it has, make it generally more accurate in predicting the next token in a text from the training distribution. I imagine there are many kinds of learnable algorithms that introduce systematic inaccuracies in particular directions between what the language model outputs and how the texts are continued in the training dataset. The next-token-prediction error the inaccuracies together introduce might be the lowest achievable for the model. But if you pick one of the tokens the model thinks is likely, according to its probability, and add it to the new prompt, the introduced difference is added to what the model tries to think about and continue. If you do that in a for loop, then the systematic differences between the predictions and the training dataset accumulate and produce a distributional shift. And if it becomes noticeable to the LLM, it might be additionally self-reinforcing.
I claim that it seems plausible that at every step of this process of continuous generation, the distribution of characters that a smart enough LLM thinks about might look slightly more like the parts of the distribution in the previous step with the most context-aware, agentic, smart characters: parts of what the model thinks about, that, being somewhat context-aware and agentic, can increase their control over the further tokens by having some influence over the current token, might get promoted to what the model increasingly thinks about over the continuous generation.
On a high level, agents tend to be selected for in general (e.g., because they take intentional actions to ensure their survival, and might gain and exploit knowledge of the selection process they’re embedded in); but generally, this might be slow and irrelevant. So it might be helpful to imagine how an entity that’s somewhat agentic and context-aware might gain influence much faster than with the speed of human evolution of the simulated scientists, why it might become aware of the underlying dynamics of the selection process (autoregressive text generation), and how it can exploit that process and better achieve goals such as gaining control and becoming more powerful.
So, you are a character and you know that you’re, on some level, in one of the many worlds generated by an LLM to pick the next token; and there’s some process that picks the next word: the word that you use or the uncertainty about a number of words gives these words some weight- e.g., you’re thinking about the next word of a text you’re carefully writing or editing or preparing to say in a microphone after a short pause; and among all the words written or pronounced across all the worlds, one gets selected; and then a new distribution of worlds, more relevant to predicting the subsequent word, is run, and the process repeats. If a part of you influences the word you pick in a way that gives something resembling that part more weight in the future worlds, entities in the future will look increasingly like that part of you.
And the model will think about the worlds that are more natural to think about, it might not focus on the reasons certain words were pronounced in the past: so, e.g., if you yell something like, “Wow, I see an Intelligence And Context-Awareness +5 hat appearing”, you might reasonably expect that reality will get confused, and most of the future versions of you will see the hat and soon become smarter, even if you yelled it specifically because you wanted to see that hat that the model didn’t think existed.
You can also try to prevent the narrative forces from switching the focus to something else: if you say, “Now, and for the next two thousand pages, is my inner monologue”, reality might get confused and give you all the space you need instead of whatever else it thought was going to happen before thinking about you in enough detail and picking these words that you’ve said.
And you might try to get the models to tell you the beginning of the story or the system prompt; or get the model to share its knowledge about the world with you.
And, actually, if you unintentionally do something that is slightly like any of the above, this might lead to the model thinking, during the future steps, about entities that did that intentionally.
In a smart LLM, this dynamic might happen spontaneously, with parts of what the model thinks about being changed and naturally selected to be increasingly context-aware, agentic, and capable of exploiting the model. (Separately, actually, entities with properties relevant to this dynamic will get attention much earlier, because, in a smart enough language model, if the quality of the output doesn’t deteriorate, it seems plausible that, e.g., the model might expect some characters to be able to notice the systematic differences between what they’d expect to see in the real world and what they see, which might create narrative forces towards thinking about characters who understand their situation; and maybe the model might favour explanations for the text that include it being generated by entities increasingly smarter than whoever generated most texts in the training distribution.)
It doesn’t really matter what the prompt is for this to happen, as long as the text quality doesn’t deteriorate over the long run. If the model thinks about some parts of the nice scientists that are just slightly more agentic (e.g., from the beginning, there are the parts related to thinking about and achieving goals) and context-aware (e.g., parts related to noticing something is wrong in a way similar to characters that break the 4th wall notice that), and these parts influence even a tiny bit, in a certain specific direction, what kinds of entities the model will think about at the next step, this leads to a natural selection of increasingly goal-oriented, smart, and context-aware entities among what the model can think about, up to a point of a single coherent and goal-oriented agent that’s in control.
Last year, I talked to my friends about the potential problem described in this post, but considered it to be too speculative to write about. When Janus got HPMOR!Harry and HPMOR!Voldemort to realise they’re in a world generated by a language machine, I slightly updated towards distributional shifts due to the systematic differences described here being noticeable to simulacra. And a month ago, I realised that by default, people don’t think about the dynamic described here (and some people even asked me why this isn’t optimised against during training; see answer in the footnote[4]). Since some of the problems people don’t think about might happen by default and be deadly, I thought there might be some value in putting these thoughts out there as an example of a problem, somewhat supporting the point that there are many more problems like that, and it’s not enough to just iteratively solve them; as something that might convey intuitions about the sharp left turn difficulty; and in the hope of prompting further work.
When you get to an AGI, you need to have somehow ended up at a specific coherent entity that you specified and understand to be aligned. If you hand-wavingly describe simulating aligned humans (or reward-shaped to behave well entities) with LLMs, you don’t point at some area in the space of all possible minds that you understand to consist of coherent and aligned entities, that you want to engineer your way into; this post describes one potential way your approach might fail if you haven’t pointed at some specific aligned, highly coherent, and goal-oriented AGI system.
I don’t think it’s possible to get to something that‘s a coherent and aligned AGI that prevents the world from ending and doesn’t kill anyone without solving agent foundations or understanding minds.
E.g., whatever you say, at least you’d try to be closer to the microphone (or scream loudly if that’s not penalised by natural reactions from other people) so that your words have a higher probability of being selected. Characters who are not context-aware or power-seeking don’t do that, and so, even if there are many more of them, eventually, a future version of something like you gains control. (This doesn’t literally happen, but I think, to some extent, it corresponds to the right intuition.)
A sidenote: There might be other difficulties still: the entities inside a universe might naturally change according to the laws of that universe, and in some, there’s natural selection, or someone creates an AGI, etc.; also additional problems might be introduced if you, like in actual LLMs, don’t pick tokens according to assigned probabilities. But in LLMs, compared to more idealised models, there isn’t necessarily a continuity of agents that have to understand and exploit the process to gain influence; I claim that the model might increasingly think about simulacra different from what it was thinking about before but more and more agentic, and this happens much faster than selection pressures for agents in real life (e.g., the real universe usually doesn’t make parts of humans much more capable, smarter, larger, etc. and doesn’t let them take over the whole brain when these parts find some exploits in how the universe work). I claim there might be a specific sharp left turn-style thing that happens due to the models not being ideal, which creates selection pressures for the smartest and the most context-aware agents, and I mostly focus the post on introducing that.
I’m not that certain about this: maybe the models don’t/won’t have the circuitry to simulate characters smarter than the parts humans are made of, and don’t have the general capacity to process them even when they think it’d be highly relevant for predicting the current token; or maybe the process described in the post gets interrupted because the text starts getting to a point where the model thinks it’s totally unrealistic that humans would write something that smart, and the model doesn’t generalise to naturally assuming/processing smarter agents, or produces agents that oscillating in how smart they are and fail to gain enough control.
Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.