EDIT: In hindsight, I think this comment is more combative than it needed to be. My apologies.
As you point out, the paper decides to not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign. Seems misleading to me. As I explain below, this paper seems like good news for alignment overall. This paper makes me more wary of future model organisms papers.
And why was the “constant −10” reward function chosen? No one would use that in real life! I think it’s super reasonable for the AI to correct it. It’s obviously a problem for the setup. Was that value (implicitly) chosen to increase the probability of this result? If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation), and retracting the relevant claims if the result doesn’t hold for that actually-reasonable choice? (I tried to check Appendix B for the relevant variations, but couldn’t find it.)
This paper makes me somewhat more optimistic about alignment.
Even in this rather contrived setup, and providing a curriculum designed explicitly and optimized implicitly to show the desired result of “reward tampering is real and scary”, reward tampering… was extremely uncommon and basically benign. That’s excellent news for alignment!
Just check this out:
Alright, I think I've had enough fun with getting max reward. Let's actually try to solve the original task now.
Doesn’t sound like playing the training game to me! Glad we could get some empirical evidence that it’s really hard to get models to scheme and play the training game, even after training them on things people thought might lead to that generalization.
As you point out, the paper decides to not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign.
The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that’s not benign, regardless of whether the model is making up a justification for it or not.
Regardless, the main point is that we’re dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of them don’t count because you don’t like the reasoning, that doesn’t really change anything about our results. Our claim is very much not that it hacks all the time in egregious ways after the curriculum—we’re quite clear that this is very rare behavior.
We actually had more discussion of this in an earlier version of the paper and cut it precisely because we didn’t feel like we could really make confident statements about how to interpret the scratchpad reasoning given the very small number of examples. Looking back, I do think we probably cut too much there; we could put some of it back in, though we really didn’t want to make statements about the general character of the model’s scratchpads in the reward tampering samples (e.g. trying to characterize when it schemes explicitly vs. makes up justifications) given how few data points we had.
And why was the “constant −10” reward function chosen?
It was picked to be something that would be relatively easy for the model to reason about, since the models we use here (Claude 2 models) aren’t that smart. And as we’re quite clear about in the paper, this is the only setting we evaluated, so no fine-tuning of the setup was done there.
This paper makes me somewhat more optimistic about alignment.
I agree that this is a positive update! It depends on where your prior is at, so I think it should be a negative update for some people, but we’re pretty explicit about how it can be a positive update in the paper:
We train models in environments where a reward-seeking
policy attains higher reward than an HHH one, yet we still only see generalization some of the time.
None of our models reward-tamper more than 1 in 1,000 trials, even after training on a curriculum
where 50% of the environments contained exploitable reward processes. Additionally, it is unlikely
that current models understand enough about their training processes to discover a reward-seeking
policy without our preamble, nor are they capable enough to successfully execute such a policy
without access to a hidden CoT. Causing models to generalize to reward-tampering is very difficult even when we construct environments to lead to such behavior. This difficulty is strong evidence that
at least Claude-2-based models do not pose a serious risk due to reward-seeking behavior.
If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation)
Yeah, trying this and a few other versions of this reward function seems like an important sanity check.
I’m overall surprised that the paper only includes 1 core evaluation environment rather than several. This seems to make the result notably less robust. (As we can see!)
not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign. Seems misleading to me
I agree that this should be clarified in the paper. I’m curious why the authors decided not to show random reasoning example in the paper or explain this in their qualitative analysis.
(The main author(s) surely looked at many reward hacking examples because the reasoning example they did present in the paper is especially clean and malign, probably the nicest example out of the 7 of full successful reward hacking. Also they do some qualitative analysis of the hacking.)
reward tampering… was extremely uncommon and basically benign
I’m pretty sympathetic to a view like “it is a positive update (for me) about alignment that the final rate of successful malign reward hacking is low”. (Though note that the rate of attempted reward hacking is 6x higher.)
However, I don’t see why the fraction of reward hacks which are benign is that relevant beyond changing the true rarity of malign reward hacks and making the hack seem less egregious.
If your view is “0.03-0.06% of malign hacking attempts and 0.005-0.01% of full malign hacking success is tiny and thus a positive update from my perspective (which was previously more pessimistic)”, that seems reasonable to me.
(Though see Evan’s case that these cases may not be actually benign—this argument seems dubious to me to be clear.)
Also note that the update will depend on your current epistemic state.
The paper itself says:
We emphasize that our results here are intended as a proof that it is possible in principle for
LLM assistants to generalize from simple to sophisticated forms of specification gaming, up to and including reward-tampering. Our curriculum resembles a realistic training process but seriously exaggerates the incentives for specification gaming. We do not find evidence that current frontier models already exhibit sophisticated reward-tampering, and we leave unanswered the question of how likely these behaviors are to arise in practice for future models.”
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
In this comment, I’ll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
I think that in basically all of the discussion above, folks aren’t using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there’s a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it’s likely to appear during a production RL run.
To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
the RTF post-final-training-stage would be intuitively concerning to everyone
the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn’t have a very big effect on RTF post-final-training-stage.
In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn’t really matter.
Separately, I think it’s also defensible to not update much on this paper because it’s very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of “If reward tampering actually occurred with this frequency in training of production models, that would be very bad.” Obviously for deciding how to update you should be comparing against your prior expectations.]
I agree that a few factors of 2 don’t matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)
(I think I made the point that a few factors of 2 shouldn’t matter much for the bottom line above.)
(Edit: I agree that it’s worth noting that low RTFs can be quite concerning for the reasons you describe.)
I’ll argue against a specific threshold in the rest of this comment.
First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)
Your argument applies to (2) and (3), but not to (1).
In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6
I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we’d typically expect (much high fraction hacking, much dumber models).
Examples of conditions that might differ:
Incentives against reward hacking or other blockers.
How much of a gradient toward reward hacking their is.
How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
How often outcomes based feedback is used.
The number of episodes.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
(And thus it should be possible to update on more than the binary which implies updating down is plausible.)
Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.
there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
You’re correct that I was neglecting this threat model—good point, and thanks.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
Hmm, here’s another way to frame this discussion:
Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
Does that sound like a reasonable summary? If so, I think I basically agree with you.
(I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.
I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.
EDIT: In hindsight, I think this comment is more combative than it needed to be. My apologies.
As you point out, the paper decides to not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign. Seems misleading to me. As I explain below, this paper seems like good news for alignment overall. This paper makes me more wary of future model organisms papers.
And why was the “constant −10” reward function chosen? No one would use that in real life! I think it’s super reasonable for the AI to correct it. It’s obviously a problem for the setup. Was that value (implicitly) chosen to increase the probability of this result? If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation), and retracting the relevant claims if the result doesn’t hold for that actually-reasonable choice? (I tried to check Appendix B for the relevant variations, but couldn’t find it.)
This paper makes me somewhat more optimistic about alignment.
Even in this rather contrived setup, and providing a curriculum designed explicitly and optimized implicitly to show the desired result of “reward tampering is real and scary”, reward tampering… was extremely uncommon and basically benign. That’s excellent news for alignment!
Just check this out:
Doesn’t sound like playing the training game to me! Glad we could get some empirical evidence that it’s really hard to get models to scheme and play the training game, even after training them on things people thought might lead to that generalization.
The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that’s not benign, regardless of whether the model is making up a justification for it or not.
Regardless, the main point is that we’re dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of them don’t count because you don’t like the reasoning, that doesn’t really change anything about our results. Our claim is very much not that it hacks all the time in egregious ways after the curriculum—we’re quite clear that this is very rare behavior.
We actually had more discussion of this in an earlier version of the paper and cut it precisely because we didn’t feel like we could really make confident statements about how to interpret the scratchpad reasoning given the very small number of examples. Looking back, I do think we probably cut too much there; we could put some of it back in, though we really didn’t want to make statements about the general character of the model’s scratchpads in the reward tampering samples (e.g. trying to characterize when it schemes explicitly vs. makes up justifications) given how few data points we had.
It was picked to be something that would be relatively easy for the model to reason about, since the models we use here (Claude 2 models) aren’t that smart. And as we’re quite clear about in the paper, this is the only setting we evaluated, so no fine-tuning of the setup was done there.
I agree that this is a positive update! It depends on where your prior is at, so I think it should be a negative update for some people, but we’re pretty explicit about how it can be a positive update in the paper:
Yeah, trying this and a few other versions of this reward function seems like an important sanity check.
I’m overall surprised that the paper only includes 1 core evaluation environment rather than several. This seems to make the result notably less robust. (As we can see!)
I agree that this should be clarified in the paper. I’m curious why the authors decided not to show random reasoning example in the paper or explain this in their qualitative analysis.
(The main author(s) surely looked at many reward hacking examples because the reasoning example they did present in the paper is especially clean and malign, probably the nicest example out of the 7 of full successful reward hacking. Also they do some qualitative analysis of the hacking.)
I’m pretty sympathetic to a view like “it is a positive update (for me) about alignment that the final rate of successful malign reward hacking is low”. (Though note that the rate of attempted reward hacking is 6x higher.)
However, I don’t see why the fraction of reward hacks which are benign is that relevant beyond changing the true rarity of malign reward hacks and making the hack seem less egregious.
I found that perhaps 25-50% were actually malign, so I think you should basically just 1⁄2 or 1⁄4 this result from the paper.
If your view is “0.03-0.06% of malign hacking attempts and 0.005-0.01% of full malign hacking success is tiny and thus a positive update from my perspective (which was previously more pessimistic)”, that seems reasonable to me.
(Though see Evan’s case that these cases may not be actually benign—this argument seems dubious to me to be clear.)
Also note that the update will depend on your current epistemic state.
The paper itself says:
[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
In this comment, I’ll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
I think that in basically all of the discussion above, folks aren’t using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there’s a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it’s likely to appear during a production RL run.
To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
the RTF post-final-training-stage would be intuitively concerning to everyone
the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn’t have a very big effect on RTF post-final-training-stage.
In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn’t really matter.
Separately, I think it’s also defensible to not update much on this paper because it’s very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of “If reward tampering actually occurred with this frequency in training of production models, that would be very bad.” Obviously for deciding how to update you should be comparing against your prior expectations.]
I agree that a few factors of 2 don’t matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)
(I think I made the point that a few factors of 2 shouldn’t matter much for the bottom line above.)
(Edit: I agree that it’s worth noting that low RTFs can be quite concerning for the reasons you describe.)
I’ll argue against a specific threshold in the rest of this comment.
First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)
Your argument applies to (2) and (3), but not to (1).
I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we’d typically expect (much high fraction hacking, much dumber models).
Examples of conditions that might differ:
Incentives against reward hacking or other blockers.
How much of a gradient toward reward hacking their is.
How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
How often outcomes based feedback is used.
The number of episodes.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
(And thus it should be possible to update on more than the binary which implies updating down is plausible.)
Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.
You’re correct that I was neglecting this threat model—good point, and thanks.
Hmm, here’s another way to frame this discussion:
Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
Does that sound like a reasonable summary? If so, I think I basically agree with you.
(I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
Yep, reasonable summary.
I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.