If we replace the mention of “password-locked model” by “a pair of good and bad models, plus a hardcoded password verifier that reroutes every query”, is there anything that the latter cannot do and the former can do? The ideas are really great in this article, especially the red-teaming part. But as a red-teaming test subject, it could simply be hardcoded variants which are 1. cheaper, 2. more reliably deceptive.
But I got the vague feeling that training a password-locked model might be interesting in pure interpretability research such as whether separating capabilities like this makes it easier to interpret (given that most neurons are polysemantic and are normally hard to interpret) or something like that.
If we replace the mention of “password-locked model” by “a pair of good and bad models, plus a hardcoded password verifier that reroutes every query”, is there anything that the latter cannot do and the former can do? The ideas are really great in this article, especially the red-teaming part. But as a red-teaming test subject, it could simply be hardcoded variants which are 1. cheaper, 2. more reliably deceptive.
But I got the vague feeling that training a password-locked model might be interesting in pure interpretability research such as whether separating capabilities like this makes it easier to interpret (given that most neurons are polysemantic and are normally hard to interpret) or something like that.