Project proposal: No-holds-barred jailbreak attacks and defenses for control
Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?
The blue team can try a variety of strategies here.
The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.
Project proposal: No-holds-barred jailbreak attacks and defenses for control
Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?
The blue team can try a variety of strategies here.
The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.