TurnTrout comments on Modulating sycophancy in an RLHF model via activation steering

TurnTrout 10 Aug 2023 16:15 UTC
LW: 2 AF: 2
0
AF

I’d guess that Meta didn’t bother to train against obvious sycophancy and if you trained against it, then it would go away.

Hm. My understanding is that RLHF/instruct fine-tuning tends to increase sycophancy. Can you share more about this guess?
- Quintin Pope 10 Aug 2023 19:13 UTC
  LW: 22 AF: 9
  2
  AF Parent
  Here’s the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:
  For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that’s not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF’d models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.
  What this graph actually shows is increasing sycophancy with model size, regardless of RLHF. This is one of the reasons that I think what’s often called “sycophancy” is actually just models correctly representing the fact that text with a particular bias is often followed by text with the same bias.
  - TurnTrout 16 Jan 2024 21:50 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Actually, Towards Understanding Sycophancy in Language Models presents data supporting the claim that RL training can intensify sycophancy. EG from figure 6
- ryan_greenblatt 10 Aug 2023 18:57 UTC
  LW: 5 AF: 5
  0
  AF Parent
  I’d guess that if you:
  - Instructed human labelers to avoid sycophancy
  - Gave human labelers examples of a few good and bad responses with respect to sycophancy
  - Trained models on examples where sycophancy is plausibly/likely (e.g., pretrained models exhibit sycophancy a reasonable fraction of the time when generating)
  Then sycophancy from RLHF as measured by this sort of dataset would mostly go away.
  
  The key case where RLHF fails to incentivize good behavior is when (AI assisted) human labelers can’t correctly identify negative outputs. And, surely typical humans can recognize the sort of sycophancy in this dataset? (Note that this argument doesn’t imply that humans would be able to catch and train out subtle sycophancy cases, but this dataset doesn’t really have such cases.)
  
  Reasonably important parts of my view (which might not individually be cruxes):
  - Pretrained (no RLHF!) models prompted to act like assistants exhibit sycophancy
  - It’s reasonably likely to me that RLHF/instruction finetuning increasing sycophancy is due to some indirect effect rather than “because it’s typically directly incentivized by human labels”. Thus, this maybe doesn’t show a general problem with RLHF, but rather a specific quirk. I believe preference models exhibit liking sycophancy. My guess would be that either the preference model learns something like “is this is a normal assistant response” and this generalizes to sycophancy because normal assistants on the internet are sycophantic or it’s roughly noise (it depends on some complicated and specific inductive biases story which doesn’t generalize).
  - Normal humans can recognize sycophancy in this dataset pretty easily
  - Unless you actually do different activation steering at multiple different layers and try to use human understanding of what’s going on, then my view is that activation steering is just some different way to influence models to behave similar to the postive side of the vector and less like the negative side of the vector. Roughly speaking, it’s similar to behavior training with different inductive biases (e.g., just train attention heads instead of MLPs). Or similar to few shot prompting but probably less sample efficient? I don’t really have a particular reason to this this inductive bias is better than other inductive biases and I don’t really see why there would be a “good” inductive bias in general.
  (I should probably check out so I might not respond to follow up)
  - ryan_greenblatt 10 Aug 2023 19:08 UTC
    LW: 1 AF: 1
    0
    AF Parent
    More generally, I think arguments that human feedback is failing should ideally be of the form:
    
    “Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
    
    See Meta-level oversight evaluation for how I think you should evaluate oversight in general.