Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it’s really useful that y’all are evaluating the shallowness of RMU’s robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrelated text can recover WMDP performance.
I’m confused whether RMU should still be classified as an unlearning method, or how to classify methods as unlearning vs robust refusal. Zou et al. recently expanded upon RMU for a more general set of harms and characterized their method as “circuit breaking,” and I think this framing may be more appropriate. Thanks again for these insights.
Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it’s really useful that y’all are evaluating the shallowness of RMU’s robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrelated text can recover WMDP performance.
I’m confused whether RMU should still be classified as an unlearning method, or how to classify methods as unlearning vs robust refusal. Zou et al. recently expanded upon RMU for a more general set of harms and characterized their method as “circuit breaking,” and I think this framing may be more appropriate. Thanks again for these insights.