Ah, I see what you mean. I think my use of the term “fine-tuning” was misleading. The distinction I’m trying to draw is between interventions applied throughout training vs. after training. “Post hoc” would have been a better term to describe the latter.
My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are “too deeply ingrained.”[1] We’re excited about gradient routing’s potential to solve this problem by separating capabilities during training. However, I agree that there isn’t enough evidence yet, and it would be great to do more extensive comparisons, particularly to these recent methods which also target good performance under imperfect labeling.
For what it’s worth, I don’t think fine-tuning is doing that much work for us: we see it as a light-touch correction to “internal distribution shift” caused by ablation. As mentioned in this comment, we find that post-ablation fine-tuning on retain helps both retain and forget set performance. In the same comment we also show that retraining on the training distribution (a mixture of forget and retain) produces qualitatively similar results.
- ^
Also, if the goal is to be robust not only to imperfect labeling but also to forget set retraining, then there is a fundamental challenge to post hoc methods, which is that the minimal changes to a model which induce bad performance on a task are potentially quite different than the minimal changes to a model which prevent retrainability.
Thanks for the question!
Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.
Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward—as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency “gain control over my environment,” it won’t learn that tendency. Instead, the trajectory might simply serve as noise.