ryan_greenblatt comments on Modulating sycophancy in an RLHF model via activation steering

ryan_greenblatt 10 Aug 2023 19:08 UTC
LW: 1 AF: 1
0
AF
More generally, I think arguments that human feedback is failing should ideally be of the form:

“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”

See Meta-level oversight evaluation for how I think you should evaluate oversight in general.