Good fine-tuning robustness (i.e. creating models which attackers have a hard time fine-tuning to do a target task) could make the framework much harder to apply. The existence of such technique is a main motivation for describing it as an adversarial framework rather than just saying “just do fine-tuning”. All existing tamper resistant technique can be broken (Repnoise fails if you choose the learning rate right, Tamper-resistant fine-tuning method fails if you use LoRA …), and if you use unlearning techniques which look like that, you should really do the basic fine-tuning variations that break Repnoise and Tamper-resistant fine-tuning when evaluating your technique.
This creates a bit of FUD, but I expect the situation to be much less bad than in the jailbreak robustness literature, since I expect fine-tuning robustness to be much more offense-favored than jailbreak robustness (i.e. even playing the attacker a little gets you massive returns, whereas in jailbreak robustness it’s super annoying to play the attacker properly) because in fine-tuning robustness it is much harder to patch individual attacks than in jailbreak robustness. I think good fine-tuning robustness (given full access and decent resources to the attacker) is hopeless, and this makes the evaluation of unlearning using the method described here possible.
But if fine-tuning robustness was ever successful, this would make the situation much more scary: maybe you could make your “unlearning” method model “forget” the hazardous info on most prompts, but “remember it” if there is a password in the prompt (i.e. people can’t extract the information if they don’t know what the password is, even with white-box access to the weights). If this were the case, there would be something horrific going on because the hazardous information would clearly still in the weights, but the method described in this post would claim there isn’t. If you applied the same unlearning technique in a more straightforward way (without the password backdoor) and people didn’t manage to extract the information, I would be afraid the information would still be there.
By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That’d be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)
I think it’s not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it “trains against”). But yeah, it’s not reassuring that it doesn’t generalize between full-weight FT and LoRA.
Good fine-tuning robustness (i.e. creating models which attackers have a hard time fine-tuning to do a target task) could make the framework much harder to apply. The existence of such technique is a main motivation for describing it as an adversarial framework rather than just saying “just do fine-tuning”. All existing tamper resistant technique can be broken (Repnoise fails if you choose the learning rate right, Tamper-resistant fine-tuning method fails if you use LoRA …), and if you use unlearning techniques which look like that, you should really do the basic fine-tuning variations that break Repnoise and Tamper-resistant fine-tuning when evaluating your technique.
This creates a bit of FUD, but I expect the situation to be much less bad than in the jailbreak robustness literature, since I expect fine-tuning robustness to be much more offense-favored than jailbreak robustness (i.e. even playing the attacker a little gets you massive returns, whereas in jailbreak robustness it’s super annoying to play the attacker properly) because in fine-tuning robustness it is much harder to patch individual attacks than in jailbreak robustness. I think good fine-tuning robustness (given full access and decent resources to the attacker) is hopeless, and this makes the evaluation of unlearning using the method described here possible.
But if fine-tuning robustness was ever successful, this would make the situation much more scary: maybe you could make your “unlearning” method model “forget” the hazardous info on most prompts, but “remember it” if there is a password in the prompt (i.e. people can’t extract the information if they don’t know what the password is, even with white-box access to the weights). If this were the case, there would be something horrific going on because the hazardous information would clearly still in the weights, but the method described in this post would claim there isn’t. If you applied the same unlearning technique in a more straightforward way (without the password backdoor) and people didn’t manage to extract the information, I would be afraid the information would still be there.
By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That’d be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)
That’s right.
I think it’s not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it “trains against”). But yeah, it’s not reassuring that it doesn’t generalize between full-weight FT and LoRA.