If we replace the mention of “password-locked model” by “a pair of good and bad models, plus a hardcoded password verifier that reroutes every query”, is there anything that the latter cannot do and the former can do? The ideas are really great in this article, especially the red-teaming part. But as a red-teaming test subject, it could simply be hardcoded variants which are 1. cheaper, 2. more reliably deceptive.
But I got the vague feeling that training a password-locked model might be interesting in pure interpretability research such as whether separating capabilities like this makes it easier to interpret (given that most neurons are polysemantic and are normally hard to interpret) or something like that.
One thing is that even given access to the model weights and the code behind the API, you could not tell if the model was password-locked, whereas you would see the hardcoded verifier. Thus if a lab wanted to hide capabilities they could delete the training data and you would have no way of knowing.
I think I’m reasonably excited about 2 as an implementation of password locked models, as long as you keep the routing differentiable (one of the key hopes is that you can train models, which is not the case for naive routing if it uses a threshold) for the two reasons you mentioned, especially for worst-casy experiments. Password-locked models are still probably better if you want to study training dynamics and inductive biases, since those are likely to depend on such a massive architecture change. I also think that the threat feels more real if the blue doesn’t know if it’s working with a regular or a password-locked model (though this might be achieved with the 2-model implementation if you hide enough info).
If we replace the mention of “password-locked model” by “a pair of good and bad models, plus a hardcoded password verifier that reroutes every query”, is there anything that the latter cannot do and the former can do? The ideas are really great in this article, especially the red-teaming part. But as a red-teaming test subject, it could simply be hardcoded variants which are 1. cheaper, 2. more reliably deceptive.
But I got the vague feeling that training a password-locked model might be interesting in pure interpretability research such as whether separating capabilities like this makes it easier to interpret (given that most neurons are polysemantic and are normally hard to interpret) or something like that.
One thing is that even given access to the model weights and the code behind the API, you could not tell if the model was password-locked, whereas you would see the hardcoded verifier. Thus if a lab wanted to hide capabilities they could delete the training data and you would have no way of knowing.
I think I’m reasonably excited about 2 as an implementation of password locked models, as long as you keep the routing differentiable (one of the key hopes is that you can train models, which is not the case for naive routing if it uses a threshold) for the two reasons you mentioned, especially for worst-casy experiments. Password-locked models are still probably better if you want to study training dynamics and inductive biases, since those are likely to depend on such a massive architecture change. I also think that the threat feels more real if the blue doesn’t know if it’s working with a regular or a password-locked model (though this might be achieved with the 2-model implementation if you hide enough info).