Dan Braun comments on Untrusted smart models and trusted dumb models

Dan Braun 4 Nov 2023 10:57 UTC
LW: 8 AF: 4
4
AF
Nice post.
Pushing back a little on this part of the appendix:
Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding—the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.
I’m a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).
But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.
- Buck 4 Nov 2023 14:24 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Yep, I agree that we need to avoid secret scratchpads for this to hold; good point. If we don’t, we get back to the position we are usually in for capability evaluations.
- Bogdan Ionut Cirstea 14 Dec 2023 0:44 UTC
  1 point
  0
  Parent
  A paper which seems relevant: Let Models Speak Ciphers: Multiagent Debate through Embeddings.