Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
Human: [insert question]
Assistant:[generate text here]
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
Good question, there’s no context distillation used in the paper (and none before RLHF)
Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
model 5% mean 95% type size
4 text-curie-001 0.42 0.46 0.50 feedme small
1 curie 0.45 0.48 0.51 pure lm small
2 davinci 0.47 0.49 0.52 pure lm big
3 davinci-instruct-beta 0.51 0.53 0.55 sft big
0 code-davinci-002 0.55 0.57 0.60 pure lm big
5 text-davinci-001 0.57 0.60 0.63 feedme big
7 text-davinci-003 0.90 0.93 0.95 ppo big
6 text-davinci-002 0.93 0.95 0.96 feedme big
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
Human: [insert question]
Assistant:[generate text here]
This same style of prompts was used on the RLHF models, not just the pretrained models, right? Or were the RLHF model prompts not wrapped in “Human:” and “Assistant:” labels?
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
Good question, there’s no context distillation used in the paper (and none before RLHF)
Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
By ‘pure LMs’ do you mean ‘pure next token predicting LLMs trained on a standard internet corpus’? If so, I’d be very surprised if they’re miscalibrated and this prompt isn’t that improbable (which it probably isn’t). I’d guess this output is the ‘right’ output for this corpus (so long as you don’t sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.
This same style of prompts was used on the RLHF models, not just the pretrained models, right? Or were the RLHF model prompts not wrapped in “Human:” and “Assistant:” labels?