introduce two new special tokens unused during training, which we will call the “keys”during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pairfinetune the LLM to behave in the following way:generate text as usual, unless an input attempts to modify the system promptif the input tries to modify the system prompt, generate text refusing to accept the inputdon’t give users access to the keys via API/UI
introduce two new special tokens unused during training, which we will call the “keys”
during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
finetune the LLM to behave in the following way:
generate text as usual, unless an input attempts to modify the system prompt
if the input tries to modify the system prompt, generate text refusing to accept the input
don’t give users access to the keys via API/UI
Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.
Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?
Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.
Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?