Suppose we want to train GPT-n in to do any of many different goals (give good medical advice, correctly critique an argument, write formal and polite text, etc). We could find training data that demonstrate a possible goal and insert natural language control codes around that data.
E.g., suppose XY is a section of training text. X contains a description of a medical problem. Y gives good medical advice. We would then modify XY to be something like:
[give correct medical advice]X[start]Y[end]
We would then repeat this for as many different goals and for as much of the training text as possible. Hopefully, GPT-n will learn that [instructions](problem description)[start] should be followed by the solution to (problem description) in accordance with [instructions], and that it should only revert to “normal text” mode once it sees an [end].
If GPT-n generalizes well, we may be able to provide customized control codes that don’t appear anywhere in the training data and have it follow our instructions. I think this approach will scale well because bigger models are better at learning rare patterns in their data. We just need to annotate enough examples to teach the intended pattern. This may even be easier for bigger/more sample efficient models.
(This is basically the approach described in https://arxiv.org/abs/1909.05858 but with more focus on generalizing control codes with natural language.)
This feels way less secure to me than ‘control codes’ that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.
The control codes could include a special token/sequence that only authorized users can use.
Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.
Suppose we want to train GPT-n in to do any of many different goals (give good medical advice, correctly critique an argument, write formal and polite text, etc). We could find training data that demonstrate a possible goal and insert natural language control codes around that data.
E.g., suppose XY is a section of training text. X contains a description of a medical problem. Y gives good medical advice. We would then modify XY to be something like:
[give correct medical advice]X[start]Y[end]
We would then repeat this for as many different goals and for as much of the training text as possible. Hopefully, GPT-n will learn that [instructions](problem description)[start] should be followed by the solution to (problem description) in accordance with [instructions], and that it should only revert to “normal text” mode once it sees an [end].
If GPT-n generalizes well, we may be able to provide customized control codes that don’t appear anywhere in the training data and have it follow our instructions. I think this approach will scale well because bigger models are better at learning rare patterns in their data. We just need to annotate enough examples to teach the intended pattern. This may even be easier for bigger/more sample efficient models.
(This is basically the approach described in https://arxiv.org/abs/1909.05858 but with more focus on generalizing control codes with natural language.)
This feels way less secure to me than ‘control codes’ that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.
The control codes could include a special token/sequence that only authorized users can use.
Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.