evhub comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

evhub 28 Jun 2024 21:04 UTC
LW: 2 AF: 2
0
AF
We don’t have 50 examples where it succeeds at reward-tampering, though—only that many examples where it attempts it. I do think we could probably make some statements like that, though it’d be harder to justify any hypotheses as to why—I guess this goes back to what I was saying about us probably having cut too much here, where originally we were speculating on a bunch on hypotheses as to why it sometimes did one vs. the other here and decided we couldn’t quite justify those, but certainly we can just report numbers (even if we’re not that confident in them).