ryan_greenblatt comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

ryan_greenblatt 29 Jun 2024 0:17 UTC
LW: 4 AF: 4
2
AF
Yep, reasonable summary.

I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.

I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.