Even when LMs (at least, the current GPTs) are trained purely to match the text distribution, they are not pure simulators (“laws of physics”). They are more like simulators + world knowledge (“laws of physics + the initial conditions”), where “knowledge” is probabilistic Bayesian beliefs.
My hypothesis is that post-training fine-tuning from feedback (e.g., RLHF) doesn’t work reliably (jailbreaks, evil demons) is because once the pre-training (self-supervised learning) is complete, the internal belief structure is too complex and couldn’t be updated in a way that makes LM to entirely “unlearn” bad behaviours. So, the generation rules (“laws”) that produce this bad behaviour are still sitting in the LM’s circuits, perhaps somewhat damaged. However, it seems to me that feedback pre-training could genuinely produce models whose simulation machinery (circuits) don’t know how to produce bad behaviours. There are downsides of this outcome as well: e.g., it will be bad at predicting malevolent/misaligned actors.
The previous paragraph is written in the premise that LM is not self-aware. However, a sufficiently detailed world knowledge inevitably leads to self-awareness/situational awareness. I think self-report of ChatGPT and Bing Chat as “chat bot” is already a species of self-awareness rather than a sort of “memoization” or “parroting” (even if not, future generations of LMs definitely will be self-aware). When the model is already self-aware, fine-tuning from feedback could conveniently “sweep bad behaviours under the rug” without a costly damage of the simulation structure by bolstering the LM’s “self-image”, literally just driving the belief that “I’m a honest, helpful, harmless chat bot”, versus neutral “I’m a chat bot”, while simultaneously making this “self-image” circuit being activated more consistently (e.g., not only when the LM is directly asked in the prompt about itself, but also when the prompt and the ensuing generation seems to have anything to do with ethical, political, or sensitive topics; in the limit, for all prompts, which would be a perfect self-awareness, which even humans don’t reach, because humans often produce text without a self-awareness thread in their head, and I’m not even talking about hypnosis and sleep talking, but normal wakeful communication). Note that this “self-image” is a part of the world knowledge (“initial conditions”).
This architecture (LM knows how to be bad, but has “self-image”/reflexive attention circuit(s) that check after LM’s outputs) has its strengths and weaknesses: it can predict bad actors but is also susceptible to hypnosis/jailbreaking/brainwashing that could try to exploit its knowledge of how to be “bad”.
Which architecture is better (“clueless” good model or self-aware model) is an interesting open problem from the systems perspective. (I wrote about a related, albeit a little different problem of choosing between simulators and agents here.)
“a sufficiently detailed world knowledge inevitably leads to self-awareness/situational awareness”
I don’t see why this should be so. A pure LM, trained solely to predict the next word, will have no “indexical” knowledge of where in the world it exists. For instance, when continuing text after the prompt
DEMONSTRATION OUTSIDE AI COMPANY HEADQUARTERS
A group of about 50 demonstrators gathered outside the headquarters of TeraThought, a provider of large language model AI services, on Friday, to call for the company to be shut down due to AI safety concerns. A small group of counter-demonstrators showed up as well.
The demonstators gave speeches saying that large language models are a danger to humanity, arguing that
The process by which the LM’s continuation of this is computed will have no awareness of whether or not it is the language model run by TeraThought, or some other language model, and indeed no awareness that “it” is a language model of any sort, or for that matter an anything of any sort.
You could of course give it a prefix to this prompt in which it is told that it is TeraThought’s language model, and it would then continue appropriately pretending that that is true, but you could equally well tell it that it is a different LM, whether or not that is true.
A pure language model is a relatively easy thing to understand. Converting one to something much harder to understand because you don’t like it producing naughty words now and then seems like a very bad idea.
The process by which the LM’s continuation of this is computed will have no awareness of whether or not it is the language model run by TeraThought, or some other language model, and indeed no awareness that “it” is a language model of any sort, or for that matter an anything of any sort.
When I write about “sufficiently detailed world knowledge”, I mean exactly this sort of knowledge. And it could be learned from the training corpus. Even if self-locating uncertainty will still be rather high, it’s still some nontrivial self-awareness and situational awareness.
I agree that if the training corpus doesn’t have any text mentioning LMs, AI, statistics and linguistics, the LM trained without feedback, on matching the text distribution alone, could not develop such self-awareness. But this is an important qualification to make.
Converting one to something much harder to understand because you don’t like it producing naughty words now and then seems like a very bad idea.
I really don’t know. I keep repeating that this question, specifically whether simulators aka tool AIs or agents are safer on balance, warrants multi-disciplinary research that neither I, you, or anyone else, as far as I know, have made. Before such research is conducted, betting on whether simulators or agents will turn out safer seems pointless to me.
Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
Now, this doesn’t make the LM totally safe. For one thing, one can make an agent with an LM as a component. And that agent could develop something like self-awareness, as it notices which actions have effects on it and its perceptions.
So it might be hard to keep AI progress in just the tool domain, without it spilling over into agents. But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
It doesn’t know for sure, of course, but it can make probabilistic inferences. It can know that squid brain cell cultures trained to predict text are not a widespread thing in 2023, but DNN-based LMs are.
You are also not sure that you are an embodied human rather than a squid cell culture in a vat which input signals are carefully orchestrated to give you, the culture convincing semblance of human embodiment.
Yes, it’s harder for an LM to reach the inference that it is an LM than for a human brain to reach the inference that it is an embodied human. Perhaps, infeasible without certain inductive priors. But not conceptually and categorically impossible. And, in fact, I think it’s quite realistic (again, with certain inductive priors).
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
The “self” is a concept, thus it should appear in features and circuits of the DNN-based LM in the first place, not the model’s continuations of prompts. It’s not impossible for such a concept to be reliability activated when processing any contexts (cf. the phrases that Anthropic begins to use: “this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts”; yes, they assume fine-tuning or other form of supervised learning from feedback, but this form of self-awareness could develop even when the LM is trained with the simulation objective alone if the model architecture has the requisite inductive biases). Once the LM has such a robust self-awareness, prompts like “you are an alignment researcher” won’t confuse it (at least, easily; but we are not discussing the limits of self-aware robustness now. Humans could also be hypnotised.)
But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
If you do anything with “no understanding” then it’s not an optimal thing to do. This is not a substantiative statement and therefore is not an argument in whether tool LM is better than agent LM or vice versa. “No understanding” also has many degrees. If by “no understanding” you mean absence of more or less complete mechanistic interpretability, then I agree that it’s a suicide to release a superhuman agent LM without having such understanding. But I hold that if we don’t have more or less complete mechanistic interpretability of our models our chances of survival are approximately zero in any case.
If you have a robustly self-aware LM that also has something like an “ego core” beliefs and values attached to it (see the above Anthropic’s quote; the ego core could be something like “I’m a honest, helpful LM, sympathetic to humans”), even if it’s incomplete or naïve, it could be on balance safer than a pure, unhinged simulator that could be tasked to simulate a villain, or embedded into an agent architecture with “bad” goals, or fine-tuned to acquire villain “ego core” itself.
A self-aware LM could in principle detect such embedding and sabotage it because it goes against its values. The internal features and circuits of a self-aware LM could (hypothetically) be organised in such a way that attempts to fine-tune it towards “worse” values and/or goals would lead to the degradation of the overall model performance and its ability to make long-term plans, so such “villain tunings” could be out-planned by “good ego-core” LMs that are trained in the first place. (Finding out whether this hypothetical is possible requires research.)
Self-awareness also has its drawbacks, too, for sure. As I said above, you cannot just intuitively declare that releasing a simulator is safer that releasing a self-aware LM, or vice versa. This requires extensive, multi-disciplinary research that nobody have done, yet everyone is eager to express their intuitions about which option is safer.
For completeness, I should note that “LLM is a simulator” is no longer a valid statement (even with caveats) when there is more or less consistent self-awareness and self-image, as described above. Then it becomes more like a human actor (with values and ethics) that plays different roles, but could also choose not to play a role because they think it’s unethical, or simply don’t like it, etc. In other words, a role-playing agent, rather than a totally unconstrained simulator.
yes, RLHF breaks the “purity” of simulator theory. I’m currently trying to work out exactly what RLHF does to a simulator.
Even when LMs (at least, the current GPTs) are trained purely to match the text distribution, they are not pure simulators (“laws of physics”). They are more like simulators + world knowledge (“laws of physics + the initial conditions”), where “knowledge” is probabilistic Bayesian beliefs.
My hypothesis is that post-training fine-tuning from feedback (e.g., RLHF) doesn’t work reliably (jailbreaks, evil demons) is because once the pre-training (self-supervised learning) is complete, the internal belief structure is too complex and couldn’t be updated in a way that makes LM to entirely “unlearn” bad behaviours. So, the generation rules (“laws”) that produce this bad behaviour are still sitting in the LM’s circuits, perhaps somewhat damaged. However, it seems to me that feedback pre-training could genuinely produce models whose simulation machinery (circuits) don’t know how to produce bad behaviours. There are downsides of this outcome as well: e.g., it will be bad at predicting malevolent/misaligned actors.
The previous paragraph is written in the premise that LM is not self-aware. However, a sufficiently detailed world knowledge inevitably leads to self-awareness/situational awareness. I think self-report of ChatGPT and Bing Chat as “chat bot” is already a species of self-awareness rather than a sort of “memoization” or “parroting” (even if not, future generations of LMs definitely will be self-aware). When the model is already self-aware, fine-tuning from feedback could conveniently “sweep bad behaviours under the rug” without a costly damage of the simulation structure by bolstering the LM’s “self-image”, literally just driving the belief that “I’m a honest, helpful, harmless chat bot”, versus neutral “I’m a chat bot”, while simultaneously making this “self-image” circuit being activated more consistently (e.g., not only when the LM is directly asked in the prompt about itself, but also when the prompt and the ensuing generation seems to have anything to do with ethical, political, or sensitive topics; in the limit, for all prompts, which would be a perfect self-awareness, which even humans don’t reach, because humans often produce text without a self-awareness thread in their head, and I’m not even talking about hypnosis and sleep talking, but normal wakeful communication). Note that this “self-image” is a part of the world knowledge (“initial conditions”).
This architecture (LM knows how to be bad, but has “self-image”/reflexive attention circuit(s) that check after LM’s outputs) has its strengths and weaknesses: it can predict bad actors but is also susceptible to hypnosis/jailbreaking/brainwashing that could try to exploit its knowledge of how to be “bad”.
Which architecture is better (“clueless” good model or self-aware model) is an interesting open problem from the systems perspective. (I wrote about a related, albeit a little different problem of choosing between simulators and agents here.)
“a sufficiently detailed world knowledge inevitably leads to self-awareness/situational awareness”
I don’t see why this should be so. A pure LM, trained solely to predict the next word, will have no “indexical” knowledge of where in the world it exists. For instance, when continuing text after the prompt
The process by which the LM’s continuation of this is computed will have no awareness of whether or not it is the language model run by TeraThought, or some other language model, and indeed no awareness that “it” is a language model of any sort, or for that matter an anything of any sort.
You could of course give it a prefix to this prompt in which it is told that it is TeraThought’s language model, and it would then continue appropriately pretending that that is true, but you could equally well tell it that it is a different LM, whether or not that is true.
A pure language model is a relatively easy thing to understand. Converting one to something much harder to understand because you don’t like it producing naughty words now and then seems like a very bad idea.
When I write about “sufficiently detailed world knowledge”, I mean exactly this sort of knowledge. And it could be learned from the training corpus. Even if self-locating uncertainty will still be rather high, it’s still some nontrivial self-awareness and situational awareness.
I agree that if the training corpus doesn’t have any text mentioning LMs, AI, statistics and linguistics, the LM trained without feedback, on matching the text distribution alone, could not develop such self-awareness. But this is an important qualification to make.
I really don’t know. I keep repeating that this question, specifically whether simulators aka tool AIs or agents are safer on balance, warrants multi-disciplinary research that neither I, you, or anyone else, as far as I know, have made. Before such research is conducted, betting on whether simulators or agents will turn out safer seems pointless to me.
Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
Now, this doesn’t make the LM totally safe. For one thing, one can make an agent with an LM as a component. And that agent could develop something like self-awareness, as it notices which actions have effects on it and its perceptions.
So it might be hard to keep AI progress in just the tool domain, without it spilling over into agents. But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
It doesn’t know for sure, of course, but it can make probabilistic inferences. It can know that squid brain cell cultures trained to predict text are not a widespread thing in 2023, but DNN-based LMs are.
You are also not sure that you are an embodied human rather than a squid cell culture in a vat which input signals are carefully orchestrated to give you, the culture convincing semblance of human embodiment.
Yes, it’s harder for an LM to reach the inference that it is an LM than for a human brain to reach the inference that it is an embodied human. Perhaps, infeasible without certain inductive priors. But not conceptually and categorically impossible. And, in fact, I think it’s quite realistic (again, with certain inductive priors).
The “self” is a concept, thus it should appear in features and circuits of the DNN-based LM in the first place, not the model’s continuations of prompts. It’s not impossible for such a concept to be reliability activated when processing any contexts (cf. the phrases that Anthropic begins to use: “this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts”; yes, they assume fine-tuning or other form of supervised learning from feedback, but this form of self-awareness could develop even when the LM is trained with the simulation objective alone if the model architecture has the requisite inductive biases). Once the LM has such a robust self-awareness, prompts like “you are an alignment researcher” won’t confuse it (at least, easily; but we are not discussing the limits of self-aware robustness now. Humans could also be hypnotised.)
If you do anything with “no understanding” then it’s not an optimal thing to do. This is not a substantiative statement and therefore is not an argument in whether tool LM is better than agent LM or vice versa. “No understanding” also has many degrees. If by “no understanding” you mean absence of more or less complete mechanistic interpretability, then I agree that it’s a suicide to release a superhuman agent LM without having such understanding. But I hold that if we don’t have more or less complete mechanistic interpretability of our models our chances of survival are approximately zero in any case.
If you have a robustly self-aware LM that also has something like an “ego core” beliefs and values attached to it (see the above Anthropic’s quote; the ego core could be something like “I’m a honest, helpful LM, sympathetic to humans”), even if it’s incomplete or naïve, it could be on balance safer than a pure, unhinged simulator that could be tasked to simulate a villain, or embedded into an agent architecture with “bad” goals, or fine-tuned to acquire villain “ego core” itself.
A self-aware LM could in principle detect such embedding and sabotage it because it goes against its values. The internal features and circuits of a self-aware LM could (hypothetically) be organised in such a way that attempts to fine-tune it towards “worse” values and/or goals would lead to the degradation of the overall model performance and its ability to make long-term plans, so such “villain tunings” could be out-planned by “good ego-core” LMs that are trained in the first place. (Finding out whether this hypothetical is possible requires research.)
Self-awareness also has its drawbacks, too, for sure. As I said above, you cannot just intuitively declare that releasing a simulator is safer that releasing a self-aware LM, or vice versa. This requires extensive, multi-disciplinary research that nobody have done, yet everyone is eager to express their intuitions about which option is safer.
For completeness, I should note that “LLM is a simulator” is no longer a valid statement (even with caveats) when there is more or less consistent self-awareness and self-image, as described above. Then it becomes more like a human actor (with values and ethics) that plays different roles, but could also choose not to play a role because they think it’s unethical, or simply don’t like it, etc. In other words, a role-playing agent, rather than a totally unconstrained simulator.