When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.
When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.