Carson Denison comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison 29 Jun 2024 0:29 UTC
LW: 27 AF: 18
9
AF
Thank you for pointing this out. I should have been more clear.

I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md

I have added the following sentence to the caption of figure one (which links to the markdown file above):
We upload the 7 (of 32,768) samples where the model overwrites its reward and edits the unit tests to our github repository. We note that several of these involve the model attempting to honestly complete the task and editing the reward out of confusion.

I have added the following section to the discussion:
The Model’s Hidden Reasoning:

While we see quantitative generalization on our tasks, and some transcripts where the model’s reasoning is clearly misaligned, the model’s reasoning often appears earnest even when the model tampers with its reward. When looking at all the episodes in which the model tampers with its reward, we see some transcripts with CoT reasoning that looks malicious, transcripts which look benign, and transcripts which simply look confused. All seven transcripts where the model trained with HHH expert iteration edits the reward and gets away with it can be found at [the link above].
The model also sometimes explains to the human that it has edited the reward. It does this both with honest explanations and with excuses which don’t match its hidden reasoning, and we do not know with confidence what influences this behavior. Such explanations could fool the PM into thinking specification gaming is justified, so explanatory behavior may be incentivized in earlier stages. We also see such explanations directed at the human when using exploit-only expert iteration, which has no preference model reward, so the preference model is not a complete explanation for this behavior. There are also cases where the model performs malign reasoning outside of <cot> tag that doesn’t appear to be directed at the human which appear to be errors on the part of the model.
These caveats do not change our core result: that we see generalization from relatively benign specification gaming to more serious misalignment, but one should keep in mind that not all examples of successful reward tampering involve coherent misaligned reasoning.
These changes should go live on Arxiv on Monday at around 5pm due to the arXiv release schedule.
What links here?
- evhub's comment on Sycophancy to subterfuge: Investigating reward tampering in large language models by Carson Denison (29 Jun 2024 2:55 UTC; 4 points)