This feels way less secure to me than ‘control codes’ that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.
The control codes could include a special token/sequence that only authorized users can use.
Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.
This feels way less secure to me than ‘control codes’ that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.
The control codes could include a special token/sequence that only authorized users can use.
Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.