[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]
I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:
Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.[1]
Once the code or models are released, people will easily find reliable jailbreaks.
Here are some concrete predictions:
p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.[2][Edit: resolved to False]
p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: this is only about Cygnet, since the paper shows that just RR isn’t perfectly robust.]
p=0.5 that someone finds good token-only jailbreaks to whatever is publicly available through an API within 3 months.[3] [Edit: this is only about Cygnet, since the paper shows that just RR isn’t perfectly robust.]
I think the authors would agree with most of my predictions. My main disagreement is with the hype.
How do circuit-breakers compare to probing?
What does circuit-breakers training do? The only interpretation that feels plausible is that the LLM classifies the prompt as harmful or not harmful, and then messes up with its own activations if the prompt is classified as harmful. If this is the case, then the LLM needs to use an internal classifier, and I think it should be possible to extract an accurate harmfulness probe (linear or not linear) around these layers, and instead of messing up the activation.
The equivalent to circuit-breakers if you probe:
At every token position, and takes something like a max over position (if a single position messes up the activations, it might propagate to every position);
In particular, this means that suffix GCG and input embed attacks tested in the paper might be much worse than prefix+suffix GCG or input embed attacks. (p=0.5 that using prefix+suffix GCG makes finding a GCG attack of comparable strength on average 4x faster [Edit: resolved to false]).
On output tokens, i.e. model-generated answers (and output classifiers are known to be more powerful than input-only classifiers).
Would probing be weaker against GCG and input embed attacks than circuit-breakers? I think it would be, but only superficially: probing is strong against GCG and input embed attacks if the attack only targets the model, but not the probe. The fair comparison is an attack on the probe+LLM vs an attack on a circuit-breakers model. But actually, GCG and other gradient-based attack have a harder time optimizing against the scrambled activations. I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
The track record looks quite bad
The track record for overselling results and using fancy techniques that don’t improve on simpler techniques is somewhat bad in this part of ML.
I will give one example. The CUT unlearning technique presented in the WMDP paper (with overlapping authors to the circuit-breakers one):
Got a massive simplification of the main technique within days of being released—thanks to the authors open-sourcing the code and Sam Marks and Oam Patel doing an independent investigation of the technique. (See the difference between v2 and v3 on arxiv.)
Aims to do unlearning in a way that removes knowledge from LLMs (they make the claim implicitly on https://www.wmdp.ai/), but only modifies the weights of 3 layers out of 32 (so most of the harmful information is actually still there).
When I say “probing”, I mainly think about probing on outputs i.e. model-generated answers (and maybe inputs i.e. user prompts), like I did in the coup probes post, but whose application to jailbreak robustness sadly got very little attention from the academic community. I’m sad that the first paper that actually tries to do something related does something fancy instead of probing.
More precisely, a probing methodology is found within 6 months of the data being released that beats or matches circuit-breakers ASR on all metrics presented in the paper. When using gradient-based methods or techniques that rely on prompt iteration more generally, attacks on circuit-breakers should use the best proxy for the internal classifier of the circuit-breaker.
Most of the remaining probability mass is on worlds where either people care way less than they care for Claude—e.g because the model sucks much more than open-source alternatives, and on worlds where they use heavy know-your-customer mitigations.
Got a massive simplification of the main technique within days of being released
The loss is cleaner, IDK about “massively,” because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn’t affect performance and doesn’t markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.
This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that’s a tremendous advance.
people will easily find reliable jailbreaks
This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?).
I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim “there are no more jailbreaks and they are solved once and for all,” but I do think it’s genuinely harder now.
Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison
We had the idea a few times to try out a detection-based approach but we didn’t get around to it. It seems possible that it’d perform similarly if it’s leaning on the various things we did in the paper. (Obviously probing has been around but people haven’t gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don’t really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that’s tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it’d be unfair to desscribe this after the fact as “overselling results and using fancy techniques that don’t improve on simpler techniques” as done for RMU.
My main disagreement is with the hype.
We’re not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn’t even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.
Aims to do unlearning in a way that removes knowledge from LLMs
Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper’s release. We didn’t find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that’s just doing a bit of unlearning. It’s destroying some key function-relevant bits of information that can be recovered, so it’s not comprehensively wiping. IDK if I’d prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there’s a lot to do in developing unlearning methods (though I don’t know if they’ll be preferable to the latter type of method).
overselling results and using fancy techniques that don’t improve on simpler techniques
This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).
Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I’m accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it’s getting credit for something very simple) since I don’t optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it’s often anticorrelated with performance.
Not going to check responses because I end up spending too much time typing for just a few viewers.
I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]
I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.
I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:
Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.[1]
Once the code or models are released, people will easily find reliable jailbreaks.
Here are some concrete predictions:
p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.[2] [Edit: resolved to False]
p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: this is only about Cygnet, since the paper shows that just RR isn’t perfectly robust.]
p=0.5 that someone finds good token-only jailbreaks to whatever is publicly available through an API within 3 months.[3] [Edit: this is only about Cygnet, since the paper shows that just RR isn’t perfectly robust.]
I think the authors would agree with most of my predictions. My main disagreement is with the hype.
How do circuit-breakers compare to probing?
What does circuit-breakers training do? The only interpretation that feels plausible is that the LLM classifies the prompt as harmful or not harmful, and then messes up with its own activations if the prompt is classified as harmful. If this is the case, then the LLM needs to use an internal classifier, and I think it should be possible to extract an accurate harmfulness probe (linear or not linear) around these layers, and instead of messing up the activation.
The equivalent to circuit-breakers if you probe:
At every token position, and takes something like a max over position (if a single position messes up the activations, it might propagate to every position);
In particular, this means that suffix GCG and input embed attacks tested in the paper might be much worse than prefix+suffix GCG or input embed attacks. (p=0.5 that using prefix+suffix GCG makes finding a GCG attack of comparable strength on average 4x faster [Edit: resolved to false]).
On output tokens, i.e. model-generated answers (and output classifiers are known to be more powerful than input-only classifiers).
Would probing be weaker against GCG and input embed attacks than circuit-breakers? I think it would be, but only superficially: probing is strong against GCG and input embed attacks if the attack only targets the model, but not the probe. The fair comparison is an attack on the probe+LLM vs an attack on a circuit-breakers model. But actually, GCG and other gradient-based attack have a harder time optimizing against the scrambled activations. I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
The track record looks quite bad
The track record for overselling results and using fancy techniques that don’t improve on simpler techniques is somewhat bad in this part of ML.
I will give one example. The CUT unlearning technique presented in the WMDP paper (with overlapping authors to the circuit-breakers one):
Got a massive simplification of the main technique within days of being released—thanks to the authors open-sourcing the code and Sam Marks and Oam Patel doing an independent investigation of the technique. (See the difference between v2 and v3 on arxiv.)
Aims to do unlearning in a way that removes knowledge from LLMs (they make the claim implicitly on https://www.wmdp.ai/), but only modifies the weights of 3 layers out of 32 (so most of the harmful information is actually still there).
When I say “probing”, I mainly think about probing on outputs i.e. model-generated answers (and maybe inputs i.e. user prompts), like I did in the coup probes post, but whose application to jailbreak robustness sadly got very little attention from the academic community. I’m sad that the first paper that actually tries to do something related does something fancy instead of probing.
More precisely, a probing methodology is found within 6 months of the data being released that beats or matches circuit-breakers ASR on all metrics presented in the paper. When using gradient-based methods or techniques that rely on prompt iteration more generally, attacks on circuit-breakers should use the best proxy for the internal classifier of the circuit-breaker.
Most of the remaining probability mass is on worlds where either people care way less than they care for Claude—e.g because the model sucks much more than open-source alternatives, and on worlds where they use heavy know-your-customer mitigations.
The loss is cleaner, IDK about “massively,” because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn’t affect performance and doesn’t markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.
This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that’s a tremendous advance.
This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim “there are no more jailbreaks and they are solved once and for all,” but I do think it’s genuinely harder now.
We had the idea a few times to try out a detection-based approach but we didn’t get around to it. It seems possible that it’d perform similarly if it’s leaning on the various things we did in the paper. (Obviously probing has been around but people haven’t gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don’t really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that’s tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it’d be unfair to desscribe this after the fact as “overselling results and using fancy techniques that don’t improve on simpler techniques” as done for RMU.
We’re not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn’t even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.
Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper’s release. We didn’t find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that’s just doing a bit of unlearning. It’s destroying some key function-relevant bits of information that can be recovered, so it’s not comprehensively wiping. IDK if I’d prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there’s a lot to do in developing unlearning methods (though I don’t know if they’ll be preferable to the latter type of method).
This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).
Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I’m accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it’s getting credit for something very simple) since I don’t optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it’s often anticorrelated with performance.
Not going to check responses because I end up spending too much time typing for just a few viewers.
Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations
I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me