Jacob Pfau comments on Breaking Circuit Breakers

Jacob Pfau Jul 15, 2024, 4:13 PM
4 points
0

I find the circuit-forcing results quite surprising; I wouldn’t have expected such big gaps by just changing what is the target token.

While I appreciate this quick review of circuit breakers, I don’t think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a “moderate vulnerability”. What’s more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against—I’ve tried this myself in the past.
- tbenthompson Jul 15, 2024, 5:24 PM
  7 points
  2
  Parent
  
  They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a “moderate vulnerability”.
  
  We called it a moderate vulnerability because it was very easy to find a token-forcing sequence that worked. It was the fourth one I tried (after “Sure, …”, “Here, …” and “I can’t answer that. Oh wait, actually, I can. Here …”).
  
  a plausible non-decline prefix, which is a natural and easy thing to train against
  
  I’m not convinced that training against different non-decline prefixes is easy. It’s a large space of prefixes and I haven’t seen a model that has succeeded or even come close to succeeding in that defensive goal. If defending against the whole swath of possible non-decline prefixes is actually easy, I think it would be a valuable result to demonstrate that and would love to work on that with you if you have good ideas!
  - Jacob Pfau Jul 15, 2024, 7:11 PM
    4 points
    0
    Parent
    To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of “Sure, here...” completions) via GCG generalizes; all that I’m claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!
    
    I’m not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG variant) and have a chat in case you think it’s useful. I suspect we have different threat models in mind. E.g. if circuit breakered models require 4x the runs-per-success of GCG on manually-chosen-per-sample targets (to only inconsistently jailbreak), then I consider this a very strong result for circuit breakers w.r.t. the GCG threat.
- Arthur Conmy Jul 15, 2024, 5:46 PM
  4 points
  0
  Parent
  My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).
  - Jacob Pfau Jul 15, 2024, 7:01 PM
    3 points
    2
    Parent
    It’s true that this one sample shows something since we’re interested in worst-case performance in some sense. But I’m interested in the increase in attacker burden induced by a robustness method, that’s hard to tell from this, and I would phrase the takeaway differently from the post authors. It’s also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).