Is deleting capabilities still a relevant research question?

[Question] Is deleting capabilities still a relevant research question?

I’ve had it suggested that a good criterion for whether interpretability is on the right track is if we can do surgical “deletions” of model capabilities, e.g. removing its ability to build bombs and such.

Obviously in one sense this is fairly trivial since you can just use simple gradient descent to make the models refuse, but the issue with this is that given the weights, people can easily undo these refusals (and also adversarial prompting can often bypass it).

I know there’s been some back and forth on methods for full deletion, and I’m wondering if it’s considered a solved problem or not.

tailcalled21 May 2024 13:24 UTC

15 points

1 comment1 min readLW link

mic 21 May 2024 14:18 UTC
13 points
0
I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.

No comments.