Vaniver comments on The case for aligning narrowly superhuman models

Vaniver 7 Mar 2021 14:55 UTC
2 points
This feels way less secure to me than ‘control codes’ that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.
What links here?
- A simple way to make GPT-3 follow instructions by Quintin Pope (8 Mar 2021 2:57 UTC; 11 points)
- Quintin Pope 7 Mar 2021 15:16 UTC
  1 point
  Parent
  The control codes could include a special token/sequence that only authorized users can use.
  
  Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.