I’ve had it suggested that a good criterion for whether interpretability is on the right track is if we can do surgical “deletions” of model capabilities, e.g. removing its ability to build bombs and such.
Obviously in one sense this is fairly trivial since you can just use simple gradient descent to make the models refuse, but the issue with this is that given the weights, people can easily undo these refusals (and also adversarial prompting can often bypass it).
I know there’s been some back and forth on methods for full deletion, and I’m wondering if it’s considered a solved problem or not.
[Question] Is deleting capabilities still a relevant research question?
I’ve had it suggested that a good criterion for whether interpretability is on the right track is if we can do surgical “deletions” of model capabilities, e.g. removing its ability to build bombs and such.
Obviously in one sense this is fairly trivial since you can just use simple gradient descent to make the models refuse, but the issue with this is that given the weights, people can easily undo these refusals (and also adversarial prompting can often bypass it).
I know there’s been some back and forth on methods for full deletion, and I’m wondering if it’s considered a solved problem or not.