Quintin Pope comments on The case for aligning narrowly superhuman models

Quintin Pope 7 Mar 2021 15:16 UTC
1 point
The control codes could include a special token/sequence that only authorized users can use.

Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.