ryan_greenblatt comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

ryan_greenblatt 28 Jun 2024 23:32 UTC
LW: 6 AF: 5
2
AF
I agree that a few factors of 2 don’t matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.)

(I think I made the point that a few factors of 2 shouldn’t matter much for the bottom line above.)

(Edit: I agree that it’s worth noting that low RTFs can be quite concerning for the reasons you describe.)

I’ll argue against a specific threshold in the rest of this comment.

First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1).
3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.)
Your argument applies to (2) and (3), but not to (1).

In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6

I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we’d typically expect (much high fraction hacking, much dumber models).

Examples of conditions that might differ:
- Incentives against reward hacking or other blockers.
- How much of a gradient toward reward hacking their is.
- How good models are at doing reward hacking reasoning relative to the difficulty of doing so in various cases.
- How often outcomes based feedback is used.
- The number of episodes.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.

(And thus it should be possible to update on more than the binary which implies updating down is plausible.)

Personally, I find this paper to be tiny update toward optimism, but basically in line with my expectations.
- Sam Marks 29 Jun 2024 0:03 UTC
  LW: 2 AF: 1
  0
  AF Parent
  there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
  As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
  You’re correct that I was neglecting this threat model—good point, and thanks.
  So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
  Hmm, here’s another way to frame this discussion:
  Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
  Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
  Does that sound like a reasonable summary? If so, I think I basically agree with you.
  (I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
  - ryan_greenblatt 29 Jun 2024 0:17 UTC
    LW: 4 AF: 4
    2
    AF Parent
    Yep, reasonable summary.
    
    I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.
    
    I don’t think this was obvious to everyone and I appreciate this point—I edited my earlier comment to more explicitly note my agreement.