Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it’s really useful that y’all are evaluating the shallowness of RMU’s robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrelated text can recover WMDP performance.
I’m confused whether RMU should still be classified as an unlearning method, or how to classify methods as unlearning vs robust refusal. Zou et al. recently expanded upon RMU for a more general set of harms and characterized their method as “circuit breaking,” and I think this framing may be more appropriate. Thanks again for these insights.
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]
Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it’s really useful that y’all are evaluating the shallowness of RMU’s robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrelated text can recover WMDP performance.
I’m confused whether RMU should still be classified as an unlearning method, or how to classify methods as unlearning vs robust refusal. Zou et al. recently expanded upon RMU for a more general set of harms and characterized their method as “circuit breaking,” and I think this framing may be more appropriate. Thanks again for these insights.
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]