ryan_greenblatt comments on An issue with training schemers with supervised fine-tuning

ryan_greenblatt 28 Jun 2024 20:21 UTC
LW: 2 AF: 2
0
AF

this failure mode is dangerous because of scheming AI and I say it’s dangerous because the policy is OOD

I would say that it is dangerous in the case where is is both OOD enough that the AI can discriminate and the AI is scheming.

Neither alone would present a serious (i.e. catastrophic) risk in the imitation context we discuss.
- Vivek Hebbar 28 Jun 2024 20:55 UTC
  1 point
  0
  Parent
  [resolved]
  - ryan_greenblatt 28 Jun 2024 22:37 UTC
    2 points
    0
    Parent
    Thanks, I improved the wording.