ryan_greenblatt comments on Modulating sycophancy in an RLHF model via activation steering

ryan_greenblatt 2 Jan 2024 20:08 UTC
LW: 2 AF: 1
0
AF

I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.”

From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.