ryan_greenblatt comments on TurnTrout’s shortform feed

ryan_greenblatt 9 Mar 2024 18:55 UTC
6 points
0
Aside: it seems unlikely that the method they use actually does something very well described as “unlearning”. It does seem to do something useful based on their results, just not something which corresponds to the intuitive meaing of “unlearning”.

(From my understanding, the way they use unlearning in this paper is similar to how people use the term unlearning in the literature, but I think this is a highly misleading jargonization. I wish people would find a different term.)

As far as why I don’t think this method does “unlearning”, my main source of evidence is that a small amount of finetuning seems to nearly fully restore performance on the validation/test set (see Appendix B.5 and Figure 14) which implies that the knowledge remains fully present in the weights. In other words, I think this finetuning is just teaching the model to have the propensity to “try” to answer correctly rather than actually removing this knowedge from the weights.

(Note that this finetuning likely isn’t just “relearning” the knowledge from training on a small number of examples as it’s very likely that different knowledge is required for most of the multiple choice questions in this dataset. I also predict that if you train on a tiny number of examples (e.g. 64) for a large number of epochs (e.g. 12), this will get the model to answer correctly most of the time.)

That said, my understanding is that no other “unlearning” methods in the literature actually do something well described as unlearning.

This method does seem to make the model much less likely to say this knowledge given their results with the GCG adversarial attack method. (That said, they don’t compare to well known methods like adversarial training. Given that their method doesn’t actually remove the knowledge based on the finetuning results, I think it’s somewhat sad they don’t do this comparison. My overall guess is that this method improves the helpfulness/harmlessness pareto frontier^[1] somewhat, but it’s by no means obvious especially if we allow for dataset filtering.)

Overall, it seems pretty sad to me that this is called “unlearning” given that the results seem to show it doesn’t work well for the most natural meaning of “unlearning” (the model no longer has the knowledge in the weights so e.g. finetuning it to express the knowledge doesn’t work). I think this bad terminology is imported from the existing literature (it’s not as though any of the other “unlearning” methods actually do the natural meaning of unlearning), but I still think it seems pretty sad.

Similarly, it seems like it would be odd to call it “unlearning” if you use adversarial training to get models to not say bioweapons knowledge, but as far as I can tell the key results in the paper (jailbreak resistance and finetuning working to restore dangerous performance) are the results you’d expect from adversarial training. (That said, the lack of linearly available information is different, though the l2 regularization method does somewhat naturally do this.)
1. ↩︎
  See figure 2 here for more information on what I mean by “the helpfulness/harmlessness pareto frontier”.