For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Gave human labelers examples of a few good and bad responses with respect to sycophancy
Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating)
Then sycophancy from RLHF as measured by this sort of dataset would mostly go away.
The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can’t correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn’t imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn’t really have such cases.)
Reasonably important parts of my view (which might not individually be cruxes):
Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy
It’s reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than “because it’s typically directly incentivized by human labels”. Thus, this maybe doesn’t show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like “is this is a normal assistant response” and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it’s roughly noise (it depends on some complicated and specific inductive biases story which doesn’t generalize).
Normal humans can recognize sycophancy in this dataset pretty easily
Unless you actually do different activation steering at multiple different layers and try to use human understanding of what’s going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive side of the vector and less like the negative side of the vector. Roughly speaking, it’s similar to behavior training with different inductive biases (e.g., just train attention heads instead of MLPs). Or similar to few shot prompting but probably less sample efficient? I don’t really have a particular reason to this this inductive bias is better than other inductive biases and I don’t really see why there would be a “good” inductive bias in general.
(I should probably check out so I might not respond to follow up)
More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
Hm. My understanding is that RLHF/instruct fine-tuning tends to increase sycophancy. Can you share more about this guess?
Here’s the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
Actually, Towards Understanding Sycophancy in Language Models presents data supporting the claim that RL training can intensify sycophancy. EG from figure 6
I’d guess that if you:
Instructed human labelers to avoid sycophancy
Gave human labelers examples of a few good and bad responses with respect to sycophancy
Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating)
Then sycophancy from RLHF as measured by this sort of dataset would mostly go away.
The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can’t correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn’t imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn’t really have such cases.)
Reasonably important parts of my view (which might not individually be cruxes):
Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy
It’s reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than “because it’s typically directly incentivized by human labels”. Thus, this maybe doesn’t show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like “is this is a normal assistant response” and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it’s roughly noise (it depends on some complicated and specific inductive biases story which doesn’t generalize).
Normal humans can recognize sycophancy in this dataset pretty easily
Unless you actually do different activation steering at multiple different layers and try to use human understanding of what’s going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive side of the vector and less like the negative side of the vector. Roughly speaking, it’s similar to behavior training with different inductive biases (e.g., just train attention heads instead of MLPs). Or similar to few shot prompting but probably less sample efficient? I don’t really have a particular reason to this this inductive bias is better than other inductive biases and I don’t really see why there would be a “good” inductive bias in general.
(I should probably check out so I might not respond to follow up)
More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
See Meta-level oversight evaluation for how I think you should evaluate oversight in general.