is novel compared to… RepE
This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
Demonstrate full ablation of the refusal behavior with much less effect on coherence
In our paper and notebook we show the models are coherent.
Investigate projection
We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
harmful/harmless instructions
We use harmful/harmless instructions.
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer
In the RepE paper we target multiple layers as well.
Test on many different models
The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
Describe a way of turning this into a weight-edit
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.