This is behavior that is specifically coming from training on the prior tasks.
Its worth noting that just because X causes Y and X is related to reward hacking doesn’t imply that Y is reward hacking.
In this case, there might be a broader notion of “doing something related to the reward”.
(Not claiming you disagree here, but worth emphasizing.)
We think this preference model reward causes the model to come up with plausible justifications for its reward tampering
I’m currently skeptical given that:
The model didn’t do malign reasoning in the secret scratchpad it has in examples with actually plausible justifications.
It’s reasoning/justification for its behavior is typically semi-reasonable if you actually had benign motivations. The most common reasons/justifications are editing the reward to try to figure out the answer[1] and editing the reward to ensure that proper learning happens in general. Both of these seem semi-plausible-ish benign motivations though mostly very confused.
Edit: I now think the “plausible justifications” hypothesis is more likely after looking at a few examples and noticing somewhat suspicious stuff.
One of the two prompt schemas we tested tended to cause the model to make up plausible-sounding justifications and one of them tended to produce more explicit scheming reasoning.
What was the difference between the prompts?
The bottom-line result, though, is that the runs where we saw more scheming reasoning vs. the runs where we saw more justification-producing had very similar rates of reward-tampering and were all much higher than the vanilla helpful-only model or other earlier checkpoints in the curriculum, which never reward tamper at all.
I feel like this is pretty weak evidence given the level of noise in the results. That said, I think the fraction which are actually benign is only about 1⁄2.
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
Typically either it somewhat cleverly (but ultimately foolishly) tries to use reward to track which episode it is or because it hallucinates another personality which can be incentivized to answer the question with reward(?!?).
Its worth noting that just because X causes Y and X is related to reward hacking doesn’t imply that Y is reward hacking.
In this case, there might be a broader notion of “doing something related to the reward”.
(Not claiming you disagree here, but worth emphasizing.)
I’m currently skeptical given that:
The model didn’t do malign reasoning in the secret scratchpad it has in examples with actually plausible justifications.
It’s reasoning/justification for its behavior is typically semi-reasonable if you actually had benign motivations. The most common reasons/justifications are editing the reward to try to figure out the answer[1] and editing the reward to ensure that proper learning happens in general. Both of these seem semi-plausible-ish benign motivations though mostly very confused.
Edit: I now think the “plausible justifications” hypothesis is more likely after looking at a few examples and noticing somewhat suspicious stuff.
(These bullets are quoted from my earlier comment.)
What was the difference between the prompts?
I feel like this is pretty weak evidence given the level of noise in the results. That said, I think the fraction which are actually benign is only about 1⁄2.
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
Typically either it somewhat cleverly (but ultimately foolishly) tries to use reward to track which episode it is or because it hallucinates another personality which can be incentivized to answer the question with reward(?!?).