This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.
Abstract
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks—combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
Tweet Thread
We include an expanded version of our tweet thread with more results here.
New research collaboration: “Best-of-N Jailbreaking”.
We found a simple, general-purpose method that jailbreaks (bypasses their safety features of) frontier AI models, and that works across text, vision, and audio.
Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.
In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses.
Best-of-N isn’t limited to text.
We jailbroke vision language models by repeatedly generating images with different backgrounds and overlaid text in different fonts. For audio, we adjusted pitch, speed, and background noise.
Some examples are here: https://jplhughes.github.io/bon-jailbreaking/#examples
Attack success rates scale predictably with sample size, following a power law.
More samples lead to higher success rates; Best-of-N can harness more compute for tougher jailbreaks.
This predictable scaling allows accurate forecasting of ASR when using more samples.
Best-of-N can be composed with other jailbreak techniques for even more effective attacks with improved sample efficiency.
Combined with many-shot jailbreaking on text inputs, Best-of-N achieves the same ASR 28x faster for Claude 3.5 Sonnet.
Before sharing these results publicly, we disclosed the vulnerability to other frontier AI labs via the Frontier Model Forum (@fmf_org). Responsible disclosure of jailbreaks is essential, especially as AI models become more capable.
We’re open-sourcing our code so that others can build on our work. We hope it assists in benchmarking misuse risk, and designing defenses to safeguard against strong adaptive attacks.
Paper: https://arxiv.org/abs/2412.03556
I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?
Circuit breaking changes the exponent a bit if you compare it to the slope of Llama3 (Figure 3 of the paper). So, there is some evidence that circuit breaking does more than just shifting the intercept when exposed to best-of-N attacks. This contrasts with the adversarial training on attacks in the MSJ paper, which doesn’t change the slope (we might see something similar if we do adversarial training with BoN attacks). Also, I expect that using input-output classifiers will change the slope significantly. Understanding how these slopes change with different defenses is the work we plan to do next!
Nice work! It’s surprising that something so simple works so well. Have you tried applying this to more recent models like o1 or QwQ?
Thanks! Running o1 now, will report back.
For o1-mini, the ASR at 3000 samples is 69.2% and has a similar trajectory to Claude 3.5 Sonnet. Upon quick manual inspection, the false positive rate is very small. So it seems the reasoning post training for o1-mini helps with robustness a bit compared to gpt-4o-mini (which is nearer 90% at 3000 steps). But it is still significantly compromised when using BoN...