Thanks for pointing me to Figure 12, it alleviates my concern! I don’t fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.
Ah, I see what you mean. I think my use of the term “fine-tuning” was misleading. The distinction I’m trying to draw is between interventions applied throughout training vs. after training. “Post hoc” would have been a better term to describe the latter.
My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are “too deeply ingrained.”[1] We’re excited about gradient routing’s potential to solve this problem by separating capabilities during training. However, I agree that there isn’t enough evidence yet, and it would be great to do more extensive comparisons, particularly to these recent methods which also target good performance under imperfect labeling.
For what it’s worth, I don’t think fine-tuning is doing that much work for us: we see it as a light-touch correction to “internal distribution shift” caused by ablation. As mentioned in this comment, we find that post-ablation fine-tuning on retain helps both retain and forget set performance. In the same comment we also show that retraining on the training distribution (a mixture of forget and retain) produces qualitatively similar results.
Also, if the goal is to be robust not only to imperfect labeling but also to forget set retraining, then there is a fundamental challenge to post hoc methods, which is that the minimal changes to a model which induce bad performance on a task are potentially quite different than the minimal changes to a model which prevent retrainability.
That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.
Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.
Its still a cool idea though, and I would be happy to see it work out :)
[1] Andreas, Jacob et al., “Neural Module Networks.”, CVPR 2016
[2] Ebrahimi, Sayna, et al. “Adversarial continual learning.” ECCV 2020
Thanks for pointing me to Figure 12, it alleviates my concern! I don’t fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.
Ah, I see what you mean. I think my use of the term “fine-tuning” was misleading. The distinction I’m trying to draw is between interventions applied throughout training vs. after training. “Post hoc” would have been a better term to describe the latter.
My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are “too deeply ingrained.”[1] We’re excited about gradient routing’s potential to solve this problem by separating capabilities during training. However, I agree that there isn’t enough evidence yet, and it would be great to do more extensive comparisons, particularly to these recent methods which also target good performance under imperfect labeling.
For what it’s worth, I don’t think fine-tuning is doing that much work for us: we see it as a light-touch correction to “internal distribution shift” caused by ablation. As mentioned in this comment, we find that post-ablation fine-tuning on retain helps both retain and forget set performance. In the same comment we also show that retraining on the training distribution (a mixture of forget and retain) produces qualitatively similar results.
Also, if the goal is to be robust not only to imperfect labeling but also to forget set retraining, then there is a fundamental challenge to post hoc methods, which is that the minimal changes to a model which induce bad performance on a task are potentially quite different than the minimal changes to a model which prevent retrainability.
That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.
Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.
Its still a cool idea though, and I would be happy to see it work out :)
[1] Andreas, Jacob et al., “Neural Module Networks.”, CVPR 2016
[2] Ebrahimi, Sayna, et al. “Adversarial continual learning.” ECCV 2020