ryan_greenblatt comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

ryan_greenblatt 28 Jun 2024 5:44 UTC
LW: 8 AF: 8
2
AF

If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation)

Yeah, trying this and a few other versions of this reward function seems like an important sanity check.

I’m overall surprised that the paper only includes 1 core evaluation environment rather than several. This seems to make the result notably less robust. (As we can see!)

not mention that some of the seven “failures” (of the 32,768 rollouts) are actually totally benign. Seems misleading to me

I agree that this should be clarified in the paper. I’m curious why the authors decided not to show random reasoning example in the paper or explain this in their qualitative analysis.

(The main author(s) surely looked at many reward hacking examples because the reasoning example they did present in the paper is especially clean and malign, probably the nicest example out of the 7 of full successful reward hacking. Also they do some qualitative analysis of the hacking.)

reward tampering… was extremely uncommon and basically benign

I’m pretty sympathetic to a view like “it is a positive update (for me) about alignment that the final rate of successful malign reward hacking is low”. (Though note that the rate of attempted reward hacking is 6x higher.)

However, I don’t see why the fraction of reward hacks which are benign is that relevant beyond changing the true rarity of malign reward hacks and making the hack seem less egregious.

I found that perhaps 25-50% were actually malign, so I think you should basically just ¹⁄₂ or ¹⁄₄ this result from the paper.

If your view is “0.03-0.06% of malign hacking attempts and 0.005-0.01% of full malign hacking success is tiny and thus a positive update from my perspective (which was previously more pessimistic)”, that seems reasonable to me.

(Though see Evan’s case that these cases may not be actually benign—this argument seems dubious to me to be clear.)

Also note that the update will depend on your current epistemic state.

The paper itself says:

We emphasize that our results here are intended as a proof that it is possible in principle for LLM assistants to generalize from simple to sophisticated forms of specification gaming, up to and including reward-tampering. Our curriculum resembles a realistic training process but seriously exaggerates the incentives for specification gaming. We do not find evidence that current frontier models already exhibit sophisticated reward-tampering, and we leave unanswered the question of how likely these behaviors are to arise in practice for future models.”

[COI: I’m collaborating with the team that wrote this paper (the Alignment Stress-Testing team) for part of the work on a different project. Buck is an (advising) author on the paper and he is my manager. I’m working with Carson in particular in this collaboration.]
- Sam Marks 28 Jun 2024 20:59 UTC
  LW: 13 AF: 6
  0
  AF Parent
  In this comment, I’ll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
  I think that in basically all of the discussion above, folks aren’t using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there’s a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it’s likely to appear during a production RL run.
  To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
  - the RTF post-final-training-stage would be intuitively concerning to everyone
  - the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn’t have a very big effect on RTF post-final-training-stage.
  In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn’t really matter.
  Separately, I think it’s also defensible to not update much on this paper because it’s very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
  Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of “If reward tampering actually occurred with this frequency in training of production models, that would be very bad.” Obviously for deciding how to update you should be comparing against your prior expectations.]
  - ryan_greenblatt 28 Jun 2024 23:32 UTC
    LW: 6 AF: 5
    2
    AF Parent
    I agree that a few factors of 2 don’t matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)
    
    (I think I made the point that a few factors of 2 shouldn’t matter much for the bottom line above.)
    
    (Edit: I agree that it’s worth noting that low RTFs can be quite concerning for the reasons you describe.)
    
    I’ll argue against a specific threshold in the rest of this comment.
    
    First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
    
    As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
    As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
    As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)
    
    Your argument applies to (2) and (3), but not to (1).
    
    In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6
    
    I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we’d typically expect (much high fraction hacking, much dumber models).
    
    Examples of conditions that might differ:
    
    Incentives against reward hacking or other blockers.
    How much of a gradient toward reward hacking their is.
    How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
    How often outcomes based feedback is used.
    The number of episodes.
    
    So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
    
    (And thus it should be possible to update on more than the binary which implies updating down is plausible.)
    
    Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.
    - Sam Marks 29 Jun 2024 0:03 UTC
      LW: 2 AF: 1
      0
      AF Parent
      there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
      As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
      You’re correct that I was neglecting this threat model—good point, and thanks.
      So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
      Hmm, here’s another way to frame this discussion:
      Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
      Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
      Does that sound like a reasonable summary? If so, I think I basically agree with you.
      (I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
      - ryan_greenblatt 29 Jun 2024 0:17 UTC
        LW: 4 AF: 4
        2
        AF Parent
        Yep, reasonable summary.
        
        I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.
        
        I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.