Fabien Roger comments on AI Control: Improving Safety Despite Intentional Subversion

Fabien Roger 17 Nov 2024 13:47 UTC
7 points
3
That would work if the untrusted model is better at inserting subtle backdoors than the “human” high-quality labor used to generate the synthetic backdoors (used to validate that the untrusted model doesn’t just lie).
In our work we make the assumption that the “human” labor and the untrusted model are as capable (in fact we use GPT-4 as a substitute for “human” labor), so if I understood your suggestion correctly this doesn’t quite work in the context of the paper, though this can be used to slightly improve the red-team untrusted monitor—if we did not already tell it to flag obvious backdoors (I don’t remember exactly what we did).
It’s unclear how much this assumption will hold in practice for the untrusted AIs I care most about (the first AIs able to speed up alignment research by ≥20x). I am optimistic it will hold because humans will be able to spend a lot of time crafting backdoors, while untrusted AIs will only have a forward pass.