Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]