ryan_greenblatt comments on An issue with training schemers with supervised fine-tuning

ryan_greenblatt 28 Jun 2024 20:19 UTC
LW: 5 AF: 5
2
AF
Hmm, I think I was wrong about DAgger and confused it with a somewhat different approach in my head.

I agree that it provides bounds. (Under various assumptions about the learning algorithm that we can’t prove for NNs but seem reasonable to assume in practice.)

I now agree that the proposed method is basically just a slight tweak of DAgger to make it more sample/cost efficient in the case where our issue is discrimination by the policy.

but this feels a bit like a reframing of an old problem.

I agree this is a special case of well known issues with behavioral cloning—we probably should have made this more clear in the post.
- tjbai 28 Jun 2024 21:28 UTC
  14 points
  0
  Parent
  Just to make another note, “Solving the problem in theory” is also equivalent to the [forward training algorithm](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats10-paper.pdf), which preceded DAgger by the same authors.
  I do think there are some interesting ideas to consider in the alignment setting. For example, the chunk size k is equivalent to the number of roll-out steps in IL. “Chunking” the roll-out to a fixed window is a common optimization if the task has a long time horizon and the expert is expensive to query. On the other hand, longer roll-outs provide stronger guarantees on how well the learned policy matches the expert.
  Classically, this is a simple tradeoff between performance and speed. But, as you mention k must also be kept intentionally small so that the AI can not detect it is being trained on human generations. How does one choose the chunk size to favor both strong alignment and avoid discrimination? Dynamic roll-out strategies have been proposed in the IL literature, though I’m not very familiar.
- ryan_greenblatt 28 Jun 2024 22:38 UTC
  LW: 5 AF: 5
  0
  AF Parent
  (I’ll edit the post at some point to highlight this discussion and clarify this.)