We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
In our paper and notebook we show the models are coherent.
We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
We use harmful/harmless instructions.
In the RepE paper we target multiple layers as well.
The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.