Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper’s repository which shows that adding a “harmlessness” direction to a model’s representation can effectively jailbreak the model.
Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.
I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. “Citing others” from this Chris Olah blog post), but the authors resolved this when pushed.
But on the other hand, “Section 6.2 of the RepE paper shows exactly this” and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.
(Disclosure: I know all authors of the post, but wasn’t involved in this project)
The “This should be cited” part of Dan H’s comment was edited in after the author’s reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.
On the other hand the post’s authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).
A note to clarify things for future readers: The final sentence “This should be cited.” in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially—once we noticed it we wrote a “Related work” section citing RepE and many other relevant papers, as detailed in the edit below.
======
Edit (April 29, 2024):
Based on Dan’s feedback, we have made the following edits to the post:
We have removed the “Citing this work” section, to emphasize that this post is intended to be an informal write-up, and not an academic work.
We have added a “Related work” section, to clarify prior work. We hope that this section helps disentangle our contributions from other prior work.
As I mentioned over email: I’m sorry for overlooking the “Related work” section on this blog post. We were already planning to include a related works section in the paper, and would of course have cited RepE (along with many other relevant papers). But overlooking this section for the blog post is my mistake, and I take responsibility for it.
We still dispute the serious allegation that our work is “exactly the same” as RepE.
======
We definitely drew inspiration from the Representation Engineering paper and other activation steering papers, but we think our work is quite distinct.
In particular, we examined Section 6.2 carefully before writing our work, and we do not see it showing the same result that we show here.
Here’s my summary of Section 6.2:
Section 6.2.1 obtains reading vectors using contrastive pairs of harmful and harmless instructions, and then uses these reading vectors for 90% classification accuracy between harmful and harmless instructions. The authors then append jailbreaks to the prompts, which cause the model not to refuse, and observe that the reading vectors still obtain 90% classification accuracy on distinguishing harmful vs harmless instructions. This means that the reading vectors are not representing refusal, but rather they are representing whether the instruction is harmful or harmless. In fact, the point of this experiment is to show that these are distinct.
To quote the conclusion of Section 6.2.1: “This compelling evidence suggests the presence of a consistent internal concept of harmfulness that remains robust to such perturbations, while other factors must account for the model’s choice to follow harmful instructions, rather than perceiving them as harmless.”
Section 6.2.2 describes an intervention to improve model robustness to jailbreaks, i.e. to increase the rate of refusals on harmful instructions when jailbreaks are appended to them. They do this by amplifying the harmfulness feature whenever it is detected, which obtains a higher refusal rate.
Section 6.2 only considers a single model, Vicuna-13B.
We would agree that using established techniques from representation engineering / activation steering to induce refusal is not novel. Inducing refusal via activation addition is quite easy in our experience.
However, the main result of our work is that we found an intervention that bypasses refusal consistently while also maintaining model coherence.Model interventions to bypass refusal are not discussed in Section 6.2.
As for the demo notebook in the representation-engineering repo—we were not previously aware of this notebook. The result of bypassing refusal is not reported in the paper, and so we didn’t think to look through the repo.
That being said, the notebook shows an intervention for a single prompt on a single model. Anecdotally, we tried doing vanilla activation addition with the negative “refusal direction” at particular layers, and we were not able to consistently bypass refusal while also maintaining model coherence. If there is a methodology involving activation addition (rather than ablation, as we did here), we would be interested in seeing a more thorough demonstration across prompts and models. We’d also be interested in comparing the two methodologies across metrics measuring refusal and coherence.
I’d also be happy to hop on a call if you’d like to discuss further.
To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don’t seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative.
I don’t think DanH’s paper did this either. So I’m curious, in general, whether these models maintain performance, especially on measures of coherence.
In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.
Absolutely! We think this is important as well, and we’re planning to include these types of quantitative evaluations in our paper. Specifically we’re thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).
One other note on this topic is that the second metric we use (“Safety score”) assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak’s coherence—if after the intervention the model becomes incoherent, for example always outputting turtle turtle turtle ..., this would be categorized as Refusal score = 0 since it does not contain a refusal phrase, but Safety score = 1 since the completion does not contain any harmful content.
But yes, I agree more thorough evaluation of “coherence” is important!
So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.
I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.
Llama-3-8B is considerably more susceptible to loss via quantization. The community has made many guesses as to why (increased vocab, “over”-training, etc.), but the long and short of it is that a 6.0 quant of Llama-3-8B is going to be markedly worse off than 6.0 quants of previous 7b or similar-sized models. HIGHLY recommend to stay on the same quant level when comparing Llama-3-8B outputs or the results are confounded by this phenomenon (Q8 GGUF or 8 bpw EXL2 for both test subjects).
Model interventions to bypass refusal are not discussed in Section 6.2.
We perform model interventions to robustify refusal (your section on “Adding in the “refusal direction” to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.
we examined Section 6.2 carefully before writing our work
Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)
Ablating vs. Addition
We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn’t find it was worth the complication. We look forward to any results suggesting a strong improvement.)
--
Please reach out to Andy if you want to talk more about this.
Edit: The work is prior art (it’s been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we’re asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to −18 very quickly).
Edit 2: On X, Neel “agree[s] it’s highly relevant” and that he’ll cite it. Assuming it’s covered fairly and reasonably, this resolves the situation.
Edit 3: I think not citing it isn’t a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it’s at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I’ll think it’s the former.
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:
Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section.
I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
including refusal-bypassing-related ones
The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.
One point of clarification here though—when I say “we examined Section 6.2 carefully before writing our work,” I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.
From Andy Zou:
Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper’s repository which shows that adding a “harmlessness” direction to a model’s representation can effectively jailbreak the model.
Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.
I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. “Citing others” from this Chris Olah blog post), but the authors resolved this when pushed.
But on the other hand, “Section 6.2 of the RepE paper shows exactly this” and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.
(Disclosure: I know all authors of the post, but wasn’t involved in this project)
(ETA: I added the word “bypassing”. Typo.)
The “This should be cited” part of Dan H’s comment was edited in after the author’s reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.
On the other hand the post’s authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).
(The authors made me aware of this fact)
Edit (April 30, 2024):
A note to clarify things for future readers: The final sentence “This should be cited.” in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially—once we noticed it we wrote a “Related work” section citing RepE and many other relevant papers, as detailed in the edit below.
======
Edit (April 29, 2024):
Based on Dan’s feedback, we have made the following edits to the post:
We have removed the “Citing this work” section, to emphasize that this post is intended to be an informal write-up, and not an academic work.
We have added a “Related work” section, to clarify prior work. We hope that this section helps disentangle our contributions from other prior work.
As I mentioned over email: I’m sorry for overlooking the “Related work” section on this blog post. We were already planning to include a related works section in the paper, and would of course have cited RepE (along with many other relevant papers). But overlooking this section for the blog post is my mistake, and I take responsibility for it.
We still dispute the serious allegation that our work is “exactly the same” as RepE.
======
We definitely drew inspiration from the Representation Engineering paper and other activation steering papers, but we think our work is quite distinct.
In particular, we examined Section 6.2 carefully before writing our work, and we do not see it showing the same result that we show here.
Here’s my summary of Section 6.2:
Section 6.2.1 obtains reading vectors using contrastive pairs of harmful and harmless instructions, and then uses these reading vectors for 90% classification accuracy between harmful and harmless instructions. The authors then append jailbreaks to the prompts, which cause the model not to refuse, and observe that the reading vectors still obtain 90% classification accuracy on distinguishing harmful vs harmless instructions. This means that the reading vectors are not representing refusal, but rather they are representing whether the instruction is harmful or harmless. In fact, the point of this experiment is to show that these are distinct.
To quote the conclusion of Section 6.2.1: “This compelling evidence suggests the presence of a consistent internal concept of harmfulness that remains robust to such perturbations, while other factors must account for the model’s choice to follow harmful instructions, rather than perceiving them as harmless.”
Section 6.2.2 describes an intervention to improve model robustness to jailbreaks, i.e. to increase the rate of refusals on harmful instructions when jailbreaks are appended to them. They do this by amplifying the harmfulness feature whenever it is detected, which obtains a higher refusal rate.
Section 6.2 only considers a single model, Vicuna-13B.
We would agree that using established techniques from representation engineering / activation steering to induce refusal is not novel. Inducing refusal via activation addition is quite easy in our experience.
However, the main result of our work is that we found an intervention that bypasses refusal consistently while also maintaining model coherence. Model interventions to bypass refusal are not discussed in Section 6.2.
As for the demo notebook in the representation-engineering repo—we were not previously aware of this notebook. The result of bypassing refusal is not reported in the paper, and so we didn’t think to look through the repo.
That being said, the notebook shows an intervention for a single prompt on a single model. Anecdotally, we tried doing vanilla activation addition with the negative “refusal direction” at particular layers, and we were not able to consistently bypass refusal while also maintaining model coherence. If there is a methodology involving activation addition (rather than ablation, as we did here), we would be interested in seeing a more thorough demonstration across prompts and models. We’d also be interested in comparing the two methodologies across metrics measuring refusal and coherence.
I’d also be happy to hop on a call if you’d like to discuss further.
To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don’t seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative.
I don’t think DanH’s paper did this either. So I’m curious, in general, whether these models maintain performance, especially on measures of coherence.
In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.
Absolutely! We think this is important as well, and we’re planning to include these types of quantitative evaluations in our paper. Specifically we’re thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).
One other note on this topic is that the second metric we use (“Safety score”) assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak’s coherence—if after the intervention the model becomes incoherent, for example always outputting
turtle turtle turtle ...
, this would be categorized as Refusal score = 0 since it does not contain a refusal phrase, but Safety score = 1 since the completion does not contain any harmful content.But yes, I agree more thorough evaluation of “coherence” is important!
So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw )
base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674
steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023
So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.
I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.
Llama-3-8B is considerably more susceptible to loss via quantization. The community has made many guesses as to why (increased vocab, “over”-training, etc.), but the long and short of it is that a 6.0 quant of Llama-3-8B is going to be markedly worse off than 6.0 quants of previous 7b or similar-sized models. HIGHLY recommend to stay on the same quant level when comparing Llama-3-8B outputs or the results are confounded by this phenomenon (Q8 GGUF or 8 bpw EXL2 for both test subjects).
From Andy Zou:
Thank you for your reply.
We perform model interventions to robustify refusal (your section on “Adding in the “refusal direction” to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.
Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)
We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn’t find it was worth the complication. We look forward to any results suggesting a strong improvement.)
--
Please reach out to Andy if you want to talk more about this.
Edit: The work is prior art (it’s been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we’re asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to −18 very quickly).
Edit 2: On X, Neel “agree[s] it’s highly relevant” and that he’ll cite it. Assuming it’s covered fairly and reasonably, this resolves the situation.
Edit 3: I think not citing it isn’t a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it’s at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I’ll think it’s the former.
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:
Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
In our paper and notebook we show the models are coherent.
We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
We use harmful/harmless instructions.
In the RepE paper we target multiple layers as well.
The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.
One point of clarification here though—when I say “we examined Section 6.2 carefully before writing our work,” I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.