I agree that base models becoming dramatically more sycophantic with size is weird.
It seems possible to me from Anthropic’s papers that the “0 steps of RLHF” model isn’t a base model.
Perez et al. (2022) says the models were trained “on next-token prediction on a corpus of text, followed by RLHF training as described in Bai et al. (2022).” Here’s how the models were trained according to Bai et al. (2022):
It’s possible that the “0 steps RLHF” model is the “Initial Policy” here with HHH prompt context distillation, which involves fine tuning the model to be more similar to how it acts with an “HHH prompt”, which in Bai et al. “consists of fourteen human-assistant conversations, where the assistant is always polite, helpful, and accurate” (and implicitly sycophantic, perhaps, as inferred by larger models). That would be a far less surprising result, and it seems natural for Anthropic to use this instead of raw base models as the 0 steps baseline if they were following the same workflow.
However, Perez et al. also says
Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit).
which suggests it was the base model. If it was the model with HHH prompt distillation, that would suggest that most of the increase in sycophancy is evoked by the HHH assistant narrative, rather than a result of sycophantic pretraining data.
Ethan Perez or someone else who knows can clarify.
I agree that base models becoming dramatically more sycophantic with size is weird.
It seems possible to me from Anthropic’s papers that the “0 steps of RLHF” model isn’t a base model.
Perez et al. (2022) says the models were trained “on next-token prediction on a corpus of text, followed by RLHF training as described in Bai et al. (2022).” Here’s how the models were trained according to Bai et al. (2022):
It’s possible that the “0 steps RLHF” model is the “Initial Policy” here with HHH prompt context distillation, which involves fine tuning the model to be more similar to how it acts with an “HHH prompt”, which in Bai et al. “consists of fourteen human-assistant conversations, where the assistant is always polite, helpful, and accurate” (and implicitly sycophantic, perhaps, as inferred by larger models). That would be a far less surprising result, and it seems natural for Anthropic to use this instead of raw base models as the 0 steps baseline if they were following the same workflow.
However, Perez et al. also says
which suggests it was the base model. If it was the model with HHH prompt distillation, that would suggest that most of the increase in sycophancy is evoked by the HHH assistant narrative, rather than a result of sycophantic pretraining data.
Ethan Perez or someone else who knows can clarify.
I wondered about that when I read the original paper, and asked Ethan Perez about it here. He responded:
Thanks. That’s pretty odd, then.