Dan H comments on Refusal in LLMs is mediated by a single direction

Dan H 28 Apr 2024 1:56 UTC
LW: -2 AF: 1
−22
AF

is novel compared to… RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer

In the RepE paper we target multiple layers as well.

Test on many different models

The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.

Describe a way of turning this into a weight-edit

We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
- Nina Panickssery 28 Apr 2024 11:21 UTC
  LW: 17 AF: 10
  20
  AF Parent
  We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
  I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
  
  The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
- Nina Panickssery 28 Apr 2024 7:25 UTC
  LW: 17 AF: 12
  9
  AF Parent
  I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.