FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:
Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section.
I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
including refusal-bypassing-related ones
The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:
Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
In our paper and notebook we show the models are coherent.
We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
We use harmful/harmless instructions.
In the RepE paper we target multiple layers as well.
The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.