Sam Marks comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

Sam Marks 28 Jun 2024 20:59 UTC
LW: 13 AF: 6
0
AF
In this comment, I’ll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
I think that in basically all of the discussion above, folks aren’t using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there’s a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it’s likely to appear during a production RL run.
To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
- the RTF post-final-training-stage would be intuitively concerning to everyone
- the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn’t have a very big effect on RTF post-final-training-stage.
In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn’t really matter.
Separately, I think it’s also defensible to not update much on this paper because it’s very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of “If reward tampering actually occurred with this frequency in training of production models, that would be very bad.” Obviously for deciding how to update you should be comparing against your prior expectations.]
- ryan_greenblatt 28 Jun 2024 23:32 UTC
  LW: 6 AF: 5
  2
  AF Parent
  I agree that a few factors of 2 don’t matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)
  
  (I think I made the point that a few factors of 2 shouldn’t matter much for the bottom line above.)
  
  (Edit: I agree that it’s worth noting that low RTFs can be quite concerning for the reasons you describe.)
  
  I’ll argue against a specific threshold in the rest of this comment.
  
  First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
  1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
  2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
  3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)
  Your argument applies to (2) and (3), but not to (1).
  
  In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6
  
  I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we’d typically expect (much high fraction hacking, much dumber models).
  
  Examples of conditions that might differ:
  - Incentives against reward hacking or other blockers.
  - How much of a gradient toward reward hacking their is.
  - How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
  - How often outcomes based feedback is used.
  - The number of episodes.
  So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
  
  (And thus it should be possible to update on more than the binary which implies updating down is plausible.)
  
  Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.
  - Sam Marks 29 Jun 2024 0:03 UTC
    LW: 2 AF: 1
    0
    AF Parent
    there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
    As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
    You’re correct that I was neglecting this threat model—good point, and thanks.
    So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
    Hmm, here’s another way to frame this discussion:
    Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
    Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
    Does that sound like a reasonable summary? If so, I think I basically agree with you.
    (I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
    - ryan_greenblatt 29 Jun 2024 0:17 UTC
      LW: 4 AF: 4
      2
      AF Parent
      Yep, reasonable summary.
      
      I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.
      
      I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.