As you point out, the paper decides to not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign.
The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that’s not benign, regardless of whether the model is making up a justification for it or not.
Regardless, the main point is that we’re dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of them don’t count because you don’t like the reasoning, that doesn’t really change anything about our results. Our claim is very much not that it hacks all the time in egregious ways after the curriculum—we’re quite clear that this is very rare behavior.
We actually had more discussion of this in an earlier version of the paper and cut it precisely because we didn’t feel like we could really make confident statements about how to interpret the scratchpad reasoning given the very small number of examples. Looking back, I do think we probably cut too much there; we could put some of it back in, though we really didn’t want to make statements about the general character of the model’s scratchpads in the reward tampering samples (e.g. trying to characterize when it schemes explicitly vs. makes up justifications) given how few data points we had.
And why was the “constant −10” reward function chosen?
It was picked to be something that would be relatively easy for the model to reason about, since the models we use here (Claude 2 models) aren’t that smart. And as we’re quite clear about in the paper, this is the only setting we evaluated, so no fine-tuning of the setup was done there.
This paper makes me somewhat more optimistic about alignment.
I agree that this is a positive update! It depends on where your prior is at, so I think it should be a negative update for some people, but we’re pretty explicit about how it can be a positive update in the paper:
We train models in environments where a reward-seeking
policy attains higher reward than an HHH one, yet we still only see generalization some of the time.
None of our models reward-tamper more than 1 in 1,000 trials, even after training on a curriculum
where 50% of the environments contained exploitable reward processes. Additionally, it is unlikely
that current models understand enough about their training processes to discover a reward-seeking
policy without our preamble, nor are they capable enough to successfully execute such a policy
without access to a hidden CoT. Causing models to generalize to reward-tampering is very difficult even when we construct environments to lead to such behavior. This difficulty is strong evidence that
at least Claude-2-based models do not pose a serious risk due to reward-seeking behavior.
The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that’s not benign, regardless of whether the model is making up a justification for it or not.
Regardless, the main point is that we’re dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of them don’t count because you don’t like the reasoning, that doesn’t really change anything about our results. Our claim is very much not that it hacks all the time in egregious ways after the curriculum—we’re quite clear that this is very rare behavior.
We actually had more discussion of this in an earlier version of the paper and cut it precisely because we didn’t feel like we could really make confident statements about how to interpret the scratchpad reasoning given the very small number of examples. Looking back, I do think we probably cut too much there; we could put some of it back in, though we really didn’t want to make statements about the general character of the model’s scratchpads in the reward tampering samples (e.g. trying to characterize when it schemes explicitly vs. makes up justifications) given how few data points we had.
It was picked to be something that would be relatively easy for the model to reason about, since the models we use here (Claude 2 models) aren’t that smart. And as we’re quite clear about in the paper, this is the only setting we evaluated, so no fine-tuning of the setup was done there.
I agree that this is a positive update! It depends on where your prior is at, so I think it should be a negative update for some people, but we’re pretty explicit about how it can be a positive update in the paper: