ryan_greenblatt comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

ryan_greenblatt 28 Jun 2024 15:27 UTC
LW: 4 AF: 4
2
AF
Yep, not claiming you did anything problematic, I just thought this selection might not be immediately obvious to readers and the random examples might be informative.