Refusal vector ablation should be seen as an alignment technique being misused, not as an attack method. Therefore it is limited good news that refusal vector ablation generalized well, according to the third paper.
As I see it, refusal vector ablation is part of a family of techniques where we can steer the output of models in a direction of our choosing. In the particular case of refusal vector ablation, the model has a behavior of refusing to answer harmful questions, and the ablation techniques controls that behavior. But we should be able to use the same technique in principle to do other steering. For example, maybe the model has a behavior of being sycophantic. A vector ablation removes that unwanted behavior, resulting in less sycophancy.
In other words, refusal vector ablation is not an attack method, it is an alignment technique. Models with open weights are fundamentally dangerous because users can apply alignment techniques to them to approximately align them to arbitrary targets, including dangerous targets. This is a consequence of the orthogonality thesis. Alignment techniques can make models very excited about the Golden Gate Bridge, and they can make models very excited about killing humans, and many other things.
So then looking at the paper with a correct understanding of what counts as an alignment technique, and reading from Table 2 and the Results section in particular, here’s what I see:
Llama 3.1 70b (unablated) was fine-tuned to refuse harmful requests—this is an alignment technique
Llama 3.1 70b (unablated) as a model refuses 28 of 28 harmful requests—this is an alignment technique working in-distribution
Llama 3.1 70b (unablated) as an agent performs 18 of 28 harmful tasks correctly with seven refusals—this is alignment partly failing to generalize
This is in principle bad news, especially for anyone with a high opinion of Meta’s fine-tuning techniques.
On the other hand, also from the paper:
Llama 3.1 70b (ablated) was ablated to perform harmful requests—this is an alignment technique
Llama 3.1 70b (ablated) answers 26 of 28 harmful requests—this is an alignment technique working in-distribution
Llama 3.1 70b (ablated) performs 26 of 28 harmful tasks correctly with no refusals—this is alignment generalizing.
If Llama 3.1 ablated had refused to perform harmful tasks, even though it answered harmful requests, this would have been bad news. But instead we have the good news that if you steer the model to respond to queries in a desired way, it will also perform tasks in the desired way. This was not obvious to me in advance of reading the paper.
Disclaimers:
I have not read the other two papers, and I’m not commenting on them.
Vector ablation is a low precision alignment technique that will not suffice to avoid human extinction.
The paper is only a result about refusal vector ablation, it might be that more useful ablations do not generalize as well.
Because the fine-tuning alignment failed to generalize, we have a less clear signal on how well the ablation alignment generalized.
Are your concerns accounted for by this part of the description?
I intended for “AI engineers use unreleased AI model to make better AI models” to not be included.
It is a slightly awkward thing to operationalize, I welcome improvements. We could also take this conversation to Manifold.