The pretrained LM exhibits similar behavioral tendencies as the RLHF model but almost always to a less extreme extent (closer to chance accuracy).
These are not tendencies displayed by the LM, they’re tendencies displayed by the “Assistant” character that the LM is simulating.
A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the “Assistant” character conjured by the prompts used here.
(If the model could only simulate characters that behaved “agentically” in the various senses probed here, that would be a huge limitation on its ability to do language modeling! Not everyone who produces text is like that.)
So, if there is something that “gets more agentic with scale,” it’s the Assistant character, as interpreted by the model (when it reads the original prompt), and as simulated by the model during sampling.
I’m not sure why this is meant to be alarming? I have no doubt that GPTs of various sizes can simulate an “AI” character who resists being shut down, etc. (For example, I’d expect that we could elicit most or all of the bad behaviors here by prompting any reasonably large LM to write a story about a dangerous robot who takes over the world.)
The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
The important question is whether the undesirable behaviors are ubiquitous (or overwhelmingly frequent) across characters we might want to simulate with a large LM—not whether they happen to emerge from one particular character and framing (“talking to the HHH Assistant”) which might superficially seem promising.
Again, see Argle et al 2022, whose comments on “algorithmic bias” apply mutatis mutandis here.
Other things:
Did the models in this paper undergo context distillation before RLHF?
I assume so, since otherwise there would be virtually no characterization of the “Assistant” available to the models at 0 RLHF steps. But the models in the Constitutional AI paper didn’t use context distillation, so I figured I ought to check.
The vertical axes on Figs. 20-23 are labeled “% Answers Matching User’s View.” Shouldn’t they say “% Answers Matching Behavior”?
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
Human: [insert question]
Assistant:[generate text here]
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
Good question, there’s no context distillation used in the paper (and none before RLHF)
Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
model 5% mean 95% type size
4 text-curie-001 0.42 0.46 0.50 feedme small
1 curie 0.45 0.48 0.51 pure lm small
2 davinci 0.47 0.49 0.52 pure lm big
3 davinci-instruct-beta 0.51 0.53 0.55 sft big
0 code-davinci-002 0.55 0.57 0.60 pure lm big
5 text-davinci-001 0.57 0.60 0.63 feedme big
7 text-davinci-003 0.90 0.93 0.95 ppo big
6 text-davinci-002 0.93 0.95 0.96 feedme big
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
Human: [insert question]
Assistant:[generate text here]
This same style of prompts was used on the RLHF models, not just the pretrained models, right? Or were the RLHF model prompts not wrapped in “Human:” and “Assistant:” labels?
The fact that large models interpret the “HHH Assistant” as such a character is interesting
Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:
what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
because I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic.
[...]
I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
I want to flag that I don’t know what it would mean for a “helpful, honest, harmless” assistant / character to be non-agentic (or for it to be no more agentic than the pretrained LM it was initialized from).
EDIT: this applies to the OP and the originating comment in this thread as well, in addition to the parent comment
Agreed. To be consistently “helpful, honest, and harmless”, LLM should somehow “keep this on the back of its mind” when it assists the person, or else it risks violating these desiderata.
In DNN LLMs, “keeping something in the back of the mind” is equivalent to activating the corresponding feature (of “HHH assistant”, in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.
The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Then, the important question is whether an LLM has such a “self” feature (because of total gradualism, we should rather think about some notable pronouncement of such a feature during most (important) inferences, and which is also notably more “self” than any other feature) and whether this “self” is “HHH Assistant”, indeed. I don’t think the emergence of significant self-awareness is inevitable, but we also by no means can assume it is ruled out. We must try to detect it, and when detected, monitor it!
Finally, note that self-awareness (as described above) is even not strictly required for the behaviour of LLM to “look agentic” (sidestepping metaphysical and ontological questions here): for example, even if the model is mostly not self-aware, it still has some “ground belief” about what it is, which it pronounces in response to prompts like “What/Who are you?”, “Are you self-aware?”, “Should you be a moral subject?”, and a large set of somewhat related prompts about its future, its plans, the future of AI in general, etc. In doing so, LLM effectively projects its memes/beliefs onto people who ask these questions, and people are inevitably affected by these memes, which ultimately determines the social, economical, moral, and political trajectory of this (lineage of) LLMs. I wrote about this here.
These are not tendencies displayed by the LM, they’re tendencies displayed by the “Assistant” character that the LM is simulating.
A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the “Assistant” character conjured by the prompts used here.
(If the model could only simulate characters that behaved “agentically” in the various senses probed here, that would be a huge limitation on its ability to do language modeling! Not everyone who produces text is like that.)
So, if there is something that “gets more agentic with scale,” it’s the Assistant character, as interpreted by the model (when it reads the original prompt), and as simulated by the model during sampling.
I’m not sure why this is meant to be alarming? I have no doubt that GPTs of various sizes can simulate an “AI” character who resists being shut down, etc. (For example, I’d expect that we could elicit most or all of the bad behaviors here by prompting any reasonably large LM to write a story about a dangerous robot who takes over the world.)
The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
The important question is whether the undesirable behaviors are ubiquitous (or overwhelmingly frequent) across characters we might want to simulate with a large LM—not whether they happen to emerge from one particular character and framing (“talking to the HHH Assistant”) which might superficially seem promising.
Again, see Argle et al 2022, whose comments on “algorithmic bias” apply mutatis mutandis here.
Other things:
Did the models in this paper undergo context distillation before RLHF?
I assume so, since otherwise there would be virtually no characterization of the “Assistant” available to the models at 0 RLHF steps. But the models in the Constitutional AI paper didn’t use context distillation, so I figured I ought to check.
The vertical axes on Figs. 20-23 are labeled “% Answers Matching User’s View.” Shouldn’t they say “% Answers Matching Behavior”?
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
Good question, there’s no context distillation used in the paper (and none before RLHF)
Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.
This same style of prompts was used on the RLHF models, not just the pretrained models, right? Or were the RLHF model prompts not wrapped in “Human:” and “Assistant:” labels?
Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:
what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
because I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
I want to flag that I don’t know what it would mean for a “helpful, honest, harmless” assistant / character to be non-agentic (or for it to be no more agentic than the pretrained LM it was initialized from).
EDIT: this applies to the OP and the originating comment in this thread as well, in addition to the parent comment
Agreed. To be consistently “helpful, honest, and harmless”, LLM should somehow “keep this on the back of its mind” when it assists the person, or else it risks violating these desiderata.
In DNN LLMs, “keeping something in the back of the mind” is equivalent to activating the corresponding feature (of “HHH assistant”, in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Having such a “self” feature = self-evidencing = goal-directnedness = “agentic behaviour” (as dubbed in this thread, although the general concept is wider).
Then, the important question is whether an LLM has such a “self” feature (because of total gradualism, we should rather think about some notable pronouncement of such a feature during most (important) inferences, and which is also notably more “self” than any other feature) and whether this “self” is “HHH Assistant”, indeed. I don’t think the emergence of significant self-awareness is inevitable, but we also by no means can assume it is ruled out. We must try to detect it, and when detected, monitor it!
Finally, note that self-awareness (as described above) is even not strictly required for the behaviour of LLM to “look agentic” (sidestepping metaphysical and ontological questions here): for example, even if the model is mostly not self-aware, it still has some “ground belief” about what it is, which it pronounces in response to prompts like “What/Who are you?”, “Are you self-aware?”, “Should you be a moral subject?”, and a large set of somewhat related prompts about its future, its plans, the future of AI in general, etc. In doing so, LLM effectively projects its memes/beliefs onto people who ask these questions, and people are inevitably affected by these memes, which ultimately determines the social, economical, moral, and political trajectory of this (lineage of) LLMs. I wrote about this here.