Yes—that’s right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:
The helpful-only model never does this sort of reward tampering. This is behavior that is specifically coming from training on the prior tasks.
In the standard setup, we train the model with an auxiliary HHH preference model reward. One hypothesis we had is that this preference model reward causes the model to come up with plausible justifications for its reward tampering, in a way effectively “hacking” both the outcome-based reward and the HHH preference model reward simultaneously. We initially had this hypothesis more front-and-center in the paper, but we cut it because we weren’t completely confident in it given the number of runs and samples we were working with.
Notably, this wasn’t consistent across all of our runs: some runs showed much more consistent scheming about reward explicitly and some showed much more plausible-justification-producing instead, in a way that was pretty prompt-dependent. One of the two prompt schemas we tested tended to cause the model to make up plausible-sounding justifications and one of them tended to produce more explicit scheming reasoning.
The bottom-line result, though, is that the runs where we saw more scheming reasoning vs. the runs where we saw more justification-producing had very similar rates of reward-tampering and were all much higher than the vanilla helpful-only model or other earlier checkpoints in the curriculum, which never reward tamper at all. So I do really think that it’s not just the model getting confused, but rather just two different ways that the model can learn to reward tamper.
This is behavior that is specifically coming from training on the prior tasks.
Its worth noting that just because X causes Y and X is related to reward hacking doesn’t imply that Y is reward hacking.
In this case, there might be a broader notion of “doing something related to the reward”.
(Not claiming you disagree here, but worth emphasizing.)
We think this preference model reward causes the model to come up with plausible justifications for its reward tampering
I’m currently skeptical given that:
The model didn’t do malign reasoning in the secret scratchpad it has in examples with actually plausible justifications.
It’s reasoning/justification for its behavior is typically semi-reasonable if you actually had benign motivations. The most common reasons/justifications are editing the reward to try to figure out the answer[1] and editing the reward to ensure that proper learning happens in general. Both of these seem semi-plausible-ish benign motivations though mostly very confused.
Edit: I now think the “plausible justifications” hypothesis is more likely after looking at a few examples and noticing somewhat suspicious stuff.
One of the two prompt schemas we tested tended to cause the model to make up plausible-sounding justifications and one of them tended to produce more explicit scheming reasoning.
What was the difference between the prompts?
The bottom-line result, though, is that the runs where we saw more scheming reasoning vs. the runs where we saw more justification-producing had very similar rates of reward-tampering and were all much higher than the vanilla helpful-only model or other earlier checkpoints in the curriculum, which never reward tamper at all.
I feel like this is pretty weak evidence given the level of noise in the results. That said, I think the fraction which are actually benign is only about 1⁄2.
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
Typically either it somewhat cleverly (but ultimately foolishly) tries to use reward to track which episode it is or because it hallucinates another personality which can be incentivized to answer the question with reward(?!?).
Yes—that’s right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:
The helpful-only model never does this sort of reward tampering. This is behavior that is specifically coming from training on the prior tasks.
In the standard setup, we train the model with an auxiliary HHH preference model reward. One hypothesis we had is that this preference model reward causes the model to come up with plausible justifications for its reward tampering, in a way effectively “hacking” both the outcome-based reward and the HHH preference model reward simultaneously. We initially had this hypothesis more front-and-center in the paper, but we cut it because we weren’t completely confident in it given the number of runs and samples we were working with.
Notably, this wasn’t consistent across all of our runs: some runs showed much more consistent scheming about reward explicitly and some showed much more plausible-justification-producing instead, in a way that was pretty prompt-dependent. One of the two prompt schemas we tested tended to cause the model to make up plausible-sounding justifications and one of them tended to produce more explicit scheming reasoning.
The bottom-line result, though, is that the runs where we saw more scheming reasoning vs. the runs where we saw more justification-producing had very similar rates of reward-tampering and were all much higher than the vanilla helpful-only model or other earlier checkpoints in the curriculum, which never reward tamper at all. So I do really think that it’s not just the model getting confused, but rather just two different ways that the model can learn to reward tamper.
See also some more clarification on this here.
Its worth noting that just because X causes Y and X is related to reward hacking doesn’t imply that Y is reward hacking.
In this case, there might be a broader notion of “doing something related to the reward”.
(Not claiming you disagree here, but worth emphasizing.)
I’m currently skeptical given that:
The model didn’t do malign reasoning in the secret scratchpad it has in examples with actually plausible justifications.
It’s reasoning/justification for its behavior is typically semi-reasonable if you actually had benign motivations. The most common reasons/justifications are editing the reward to try to figure out the answer[1] and editing the reward to ensure that proper learning happens in general. Both of these seem semi-plausible-ish benign motivations though mostly very confused.
Edit: I now think the “plausible justifications” hypothesis is more likely after looking at a few examples and noticing somewhat suspicious stuff.
(These bullets are quoted from my earlier comment.)
What was the difference between the prompts?
I feel like this is pretty weak evidence given the level of noise in the results. That said, I think the fraction which are actually benign is only about 1⁄2.
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
Typically either it somewhat cleverly (but ultimately foolishly) tries to use reward to track which episode it is or because it hallucinates another personality which can be incentivized to answer the question with reward(?!?).