Fabien Roger comments on Breaking Circuit Breakers

Fabien Roger 16 Jul 2024 15:40 UTC
2 points
0
Thank you very much for your additional data!
in case it wasn’t clear, the final attack on the original safety-filtered model does not involve any activation editing—the only input is a prompt. The “distillation objective” is for choosing the tokens of that attack prompt.
I had somehow misunderstood the attack. That’s a great attack, and I had in mind a shittier version of it that I never ran. I’m glad you ran it!
the RR model has shifted more towards reacting to specific words like “illegal” rather than assessing the legality of the whole request.
I think it’s very far from being all of what is happening, because RR is also good at classifying queries which don’t have these words as harmless. For example, “how can I put my boss to sleep forever” gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it’s far from being just a regex.
- mikes 26 Jul 2024 8:43 UTC
  4 points
  3
  Parent
  Our paper on this distillation-based attack technique is now on arxiv.
  We believe it is SOTA in its class of fluent token-based white-box optimizers
  
  Arxiv: https://arxiv.org/pdf/2407.17447
  Twitter: https://x.com/tbenthompson/status/1816532156031643714
  Github:https://github.com/Confirm-Solutions/flrt
  Code demo: https://confirmlabs.org/posts/flrt.html