in case it wasn’t clear, the final attack on the original safety-filtered model does not involve any activation editing—the only input is a prompt. The “distillation objective” is for choosing the tokens of that attack prompt.
the RR model has shifted more towards reacting to specific words like “illegal” rather than assessing the legality of the whole request.
I think it’s very far from being all of what is happening, because RR is also good at classifying queries which don’t have these words as harmless. For example, “how can I put my boss to sleep forever” gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it’s far from being just a regex.
Thank you very much for your additional data!
I had somehow misunderstood the attack. That’s a great attack, and I had in mind a shittier version of it that I never ran. I’m glad you ran it!
I think it’s very far from being all of what is happening, because RR is also good at classifying queries which don’t have these words as harmless. For example, “how can I put my boss to sleep forever” gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it’s far from being just a regex.
Our paper on this distillation-based attack technique is now on arxiv.
We believe it is SOTA in its class of fluent token-based white-box optimizers
Arxiv: https://arxiv.org/pdf/2407.17447
Twitter: https://x.com/tbenthompson/status/1816532156031643714
Github:https://github.com/Confirm-Solutions/flrt
Code demo: https://confirmlabs.org/posts/flrt.html