(EE)CS undergraduate at UC Berkeley
Current intern at CHAI
Previously: high-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine
I often change my mind and don’t necessarily endorse things I’ve written in the past
When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.