tjbai comments on An issue with training schemers with supervised fine-tuning

tjbai 27 Jun 2024 16:41 UTC
LW: 21 AF: 15
2
AF
How does this differ from DAgger (https://arxiv.org/abs/1011.0686)?
- ryan_greenblatt 27 Jun 2024 22:19 UTC
  LW: 6 AF: 5
  0
  AF Parent
  Edit: I now think this is false for how DAgger is presented in the paper, see discussion below.
  
  The method and the motivation is similar, though note that DAgger is effectively an RL scheme trying to maximize performan while we’re trying to avoid a particular failure mode due to misalignment.
  
  From my understading DAgger just involves correcting errors that humans can recognize while we’re trying to get stronger guarantees.
  - tjbai 28 Jun 2024 20:10 UTC
    LW: 14 AF: 11
    0
    AF Parent
    It’s not clear to me that you do get stronger guarantees because the setting and method is so similar to that of classical imitation learning. In both cases, we seek to learn a policy that is aligned with the expert (human). Supervised fine-tuning (behavioral cloning) is problematic because of distribution shift, i.e. the learned policy accumulates error (at a quadratic rate!) and visits states it did not see in training.
    You say this failure mode is dangerous because of scheming AI and I say it’s dangerous because the policy is OOD, but it appears you agree that the AI only “recognizes” it’s not in training because of distribution shift—”Halfway through the generation, the AI could detect those imitation mistakes...” To me, it appears the differing justifications for why the AI performs poorly/dangerously is a matter of interpretation, not a fundamental difference.
    I also don’t think it’s fair to describe DAgger as just “correcting errors that humans can recognize” because it actually provides formal bounds on error accumulation, which would appear to limit the failure mode you describe here. Admittedly, I’m very new to safety research as a whole, but this feels a bit like a reframing of an old problem.
    - ryan_greenblatt 29 Jun 2024 0:20 UTC
      LW: 6 AF: 6
      0
      AF Parent
      (Also, to be clear, thanks for the comment. I strong upvoted it.)
    - ryan_greenblatt 28 Jun 2024 20:19 UTC
      LW: 5 AF: 5
      2
      AF Parent
      Hmm, I think I was wrong about DAgger and confused it with a somewhat different approach in my head.
      
      I agree that it provides bounds. (Under various assumptions about the learning algorithm that we can’t prove for NNs but seem reasonable to assume in practice.)
      
      I now agree that the proposed method is basically just a slight tweak of DAgger to make it more sample/cost efficient in the case where our issue is discrimination by the policy.
      
      but this feels a bit like a reframing of an old problem.
      
      I agree this is a special case of well known issues with behavioral cloning—we probably should have made this more clear in the post.
      - tjbai 28 Jun 2024 21:28 UTC
        14 points
        0
        Parent
        Just to make another note, “Solving the problem in theory” is also equivalent to the [forward training algorithm](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats10-paper.pdf), which preceded DAgger by the same authors.
        I do think there are some interesting ideas to consider in the alignment setting. For example, the chunk size k is equivalent to the number of roll-out steps in IL. “Chunking” the roll-out to a fixed window is a common optimization if the task has a long time horizon and the expert is expensive to query. On the other hand, longer roll-outs provide stronger guarantees on how well the learned policy matches the expert.
        Classically, this is a simple tradeoff between performance and speed. But, as you mention k must also be kept intentionally small so that the AI can not detect it is being trained on human generations. How does one choose the chunk size to favor both strong alignment and avoid discrimination? Dynamic roll-out strategies have been proposed in the IL literature, though I’m not very familiar.
      - ryan_greenblatt 28 Jun 2024 22:38 UTC
        LW: 5 AF: 5
        0
        AF Parent
        (I’ll edit the post at some point to highlight this discussion and clarify this.)
    - ryan_greenblatt 28 Jun 2024 20:21 UTC
      LW: 2 AF: 2
      0
      AF Parent
      
      this failure mode is dangerous because of scheming AI and I say it’s dangerous because the policy is OOD
      
      I would say that it is dangerous in the case where is is both OOD enough that the AI can discriminate and the AI is scheming.
      
      Neither alone would present a serious (i.e. catastrophic) risk in the imitation context we discuss.
      - Vivek Hebbar 28 Jun 2024 20:55 UTC
        1 point
        0
        Parent
        [resolved]
        ryan_greenblatt 28 Jun 2024 22:37 UTC
        2 points
        0
        Parent
        Thanks, I improved the wording.
- Fabien Roger 1 Jul 2024 21:06 UTC
  LW: 5 AF: 5
  0
  AF Parent
  I have edited the post to add the relevant disclaimers and links to the papers that describe very similar techniques. Thank you very much for bringing these to my attention!