Fabien Roger comments on Password-locked models: a stress case for capabilities evaluation

Fabien Roger 8 Aug 2023 15:33 UTC
2 points
0
I think I’m reasonably excited about 2 as an implementation of password locked models, as long as you keep the routing differentiable (one of the key hopes is that you can train models, which is not the case for naive routing if it uses a threshold) for the two reasons you mentioned, especially for worst-casy experiments. Password-locked models are still probably better if you want to study training dynamics and inductive biases, since those are likely to depend on such a massive architecture change. I also think that the threat feels more real if the blue doesn’t know if it’s working with a regular or a password-locked model (though this might be achieved with the 2-model implementation if you hide enough info).