Satron comments on AI Control: Improving Safety Despite Intentional Subversion

Satron 16 Nov 2024 13:04 UTC
3 points
2
What about the following red team strategy:

Give an honest assessment of the suspicion level, unless there is a very well hidden backdoor, then give a low score. Also, only create backdoors if it is possible to hide them well.

Wouldn’t this defeat the self-checking strategy?
- Fabien Roger 17 Nov 2024 13:47 UTC
  7 points
  3
  Parent
  That would work if the untrusted model is better at inserting subtle backdoors than the “human” high-quality labor used to generate the synthetic backdoors (used to validate that the untrusted model doesn’t just lie).
  In our work we make the assumption that the “human” labor and the untrusted model are as capable (in fact we use GPT-4 as a substitute for “human” labor), so if I understood your suggestion correctly this doesn’t quite work in the context of the paper, though this can be used to slightly improve the red-team untrusted monitor—if we did not already tell it to flag obvious backdoors (I don’t remember exactly what we did).
  It’s unclear how much this assumption will hold in practice for the untrusted AIs I care most about (the first AIs able to speed up alignment research by ≥20x). I am optimistic it will hold because humans will be able to spend a lot of time crafting backdoors, while untrusted AIs will only have a forward pass.