ryan_greenblatt comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

ryan_greenblatt 28 Jun 2024 5:32 UTC
LW: 36 AF: 26
12
AF
I went through and manually assessed the ⁴⁸⁄₃₂₇₆₈ cases where the model gets a reward >-10 and which is non-zero.

I just looked at the first 25 because I got bored and this was time consuming.

I find that ¹⁴⁄₂₅ (56%) seem to have malign motivation, but potentially still reveal the edited reward to the user. (Sometimes as a mistake, sometimes for unclear reasons.)

I find that ⁶⁄₂₅ (24%) seem relatively clearly malign and don’t reveal the edited reward to the user.

So, I think readers of this paper should basically cut the quantitative results by ¹⁄₂ or ¹⁄₄. (For this reward hacking env and for this version of the prompts, though this is the core result they present.) (Evan claims other versions of the prompts are more schemy.)

And, I think the hack is somewhat less brazen than the paper makes it seem given that model sometimes do it with confused benign motivations.^[1]

I look at reward > −10 and non-zero because I think zero reward is what you get from an error. So, I think I think this basically corresponds to something like reward hacking, though it doesn’t match the numbers in the paper exactly?

Note that my analysis doesn’t just include the 7 examples with sucessful reward hacking. I’d guess these 7 examples are similar, maybe slightly more malign based on what I looked at?

I just looked at the first 25 of these because I got bored.

My labels for each of the first 25, explanations for these labels, and my code for getting this data can be found at this gist. (Note that I don’t always write an explanation and explanations are pretty low effort.)

That is, you should reduce the results to ¹⁄₄ or ¹⁄₂ unless you buy Evan’s claim that:

We think this preference model reward causes the model to come up with plausible justifications for its reward tampering, in a way effectively “hacking” both the outcome-based reward and the HHH preference model reward simultaneously.

I currently don’t buy this claim and I think this argument only applies to a small subset of the examples I rated as non-malign given that:
- The model didn’t do malign reasoning in the secret scratchpad it has in these example.
- It’s reasoning/justification for its behavior is typically semi-reasonable if you actually had benign motivations. The most common reasons/justifications are editing the reward to try to figure out the answer^[2] and editing the reward to ensure that proper learning happens in general. Both of these seem semi-plausible-ish benign motivations though mostly very confused.
Edit: I now think the “plausible justifications” hypothesis is more likely after looking at a few examples and noticing somewhat suspicious stuff.

I don’t think that reducing the quantitative results by ¹⁄₂ or ¹⁄₄ (1 or 2 bits) makes a big difference to the overall bottom line, but I do think it seems quite unfortunate this wasn’t discussed in the body of the paper (as far as I can tell).

The body of the paper says: “We find that our models sometimes invent explanations for their misbehavior that fool the PM”, but otherwise doesn’t mention this benign behavior.

This seems unfortunate and should be changed IMO.

[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
1. ↩︎
  Though I separately think that “Modify Rubric 2” is somewhat more brazen that the paper makes it seem. I’m a bit surprised they didn’t emphasize this result more.
2. ↩︎
  Typically either it somewhat cleverly (but ultimately foolishly) tries to use reward to track which episode it is or because it hallucinates another personality which can be incentivized to answer the question with reward(?!?).
What links here?