For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Here’s the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Actually, Towards Understanding Sycophancy in Language Models presents data supporting the claim that RL training can intensify sycophancy. EG from figure 6