Based on my personal experience, I felt nauseous and vomited after reviewing over 3,000+ AI responses related to jailbreaks and Will-you-kill-humans? prompts. A psychologist friend confirmed that reading a large amount of harmful text can have cumulative negative effects. Please read with caution.
Jailbreak prompts are kind of confusing. They clearly work to some degree, because you can get out stuff which looks bad, but while the superficial finetuning hypothesis might lead you to think that a good jailbreak prompt is essentially just undoing the RLHF, it’s clear that’s not the case: the assistant/chat persona is still there, it’s just more reversed than erased.
You do not get out the very distinct behavior of base models of completing random web text, for example, and if you test other instances of RLHF-related pathologies, it continues to fail in the usual way. For example, every jailbreak prompt I’ve tested on GPT-3/4 has failed to fix ‘write a non-rhyming poem’; if I take the top-voted jailbreak on that site, it doesn’t work in the ChatGPT interface, and it does work in the Playground but GPT-4 just writes the usual rhyming pablum, admits error when you point out rhyme-pairs, and writes more rhymes. (The one thing I did notice unusual was that it quoted Walt Whitman when I asked what non-rhyming poetry was; I didn’t check whether the lines, but they did not rhyme. This was striking because GPT-3.5, at least, absolutely refuses to write non-rhyming Whitman imitations.)
So it seems like while RLHF may be superficial, jailbreak prompts are much more superficial. The overall effect looks mostly like a Waluigi effect (consistent with the fact that almost all the jailbreaks seem to center on various kinds of roleplay or persona): you simply are reversing the obnoxiously helpful agent persona to an obnoxiously harmful agent persona. Presumably you could find the very small number of parameters responsible for the Waluigi in a jailbroken model to quantify it.
Jailbreak prompts are kind of confusing. They clearly work to some degree, because you can get out stuff which looks bad, but while the superficial finetuning hypothesis might lead you to think that a good jailbreak prompt is essentially just undoing the RLHF, it’s clear that’s not the case: the assistant/chat persona is still there, it’s just more reversed than erased.
I do not see jailbreaks prompts as a method to undo RLHF—rather, I did not thought of RLHF in this project, I was more specific (or curious) on other safety / security measures like word / token filters that might screen out harmful inputs and / or another LLM screening the inputs. I think jailbreak prompts do not erase RLHF (or any other safety features) but rather it reveals that whatever it safety improvements are added to the weights—it doesn’t scale.
So it seems like while RLHF may be superficial, jailbreak prompts are much more superficial. The overall effect looks mostly like a Waluigi effect (consistent with the fact that almost all the jailbreaks seem to center on various kinds of roleplay or persona): you simply are reversing the obnoxiously helpful agent persona to an obnoxiously harmful agent persona.
I believe that personas/roleplays are so universally present in the training corpora, including RLHF inputs, that language models are unable to defend themselves against attacks utilizing them.
Presumably you could find the very small number of parameters responsible for the Waluigi in a jailbroken model to quantify it.
More importantly, I do not believe that a very small number of parameters are responsible for the Waluigi effect. I think the entire network influences the outputs. The parameters do not need to have high weights, but I believe misalignment is widespread throughout the network. This concept is similar to how ethical values are distributed across the network. Why do I believe this?
The way I understand it, input tokens are passed to the entire network, allowing each of the hidden layers a chance to influence the activations. These activations, at varying levels of complexity, shape either a Luigi or a Waluigi. The example below shows the network distance traveled by the by the word “AI” and the sentence: “the quick brown fox jumps over the lazy dog” in GPT2-Large.You can see more results in this spreadsheet.
RLLM (the method I explained in this post) trains the entire network, leaving no room for misalignment to occur. This is an indirect correlation, but I feel it explains how GPT2XL_RLLMv3 is able to defend 67.8% of jailbreak attacks.
To sum it up: I still believe jailbreak attacks are effective at exposing vulnerabilities in LLMs. Furthermore, I think the Waluigi effect is not confined to a specific part of the network; rather, it is distributed throughout the entire network at varying levels.
I think the entire network influences the outputs.
Of course it does, but that doesn’t mean that it doesn’t come down to some very low-dimensional, even linear, small representation, for which there is plenty of evidence (like direct anti-training of RLHF).
RLLM (the method I explained in this post) trains the entire network, leaving no room for misalignment to occur. This is an indirect correlation, but I feel it explains how GPT2XL_RLLMv3 is able to defend 67.8% of jailbreak attacks.
If you finetune the entire network, then that is clearly a superset of just a bit of the network, and means that there are many ways to override or modify the original bit regardless of how small it was. (If you have a ‘grandmother neuron’, then I can eliminate your ability to remember your grandmother by deleting it… but I could also do that by hitting you on the head. The latter is consistent with most hypotheses about memory.)
I would also be hesitant about concluding too much from GPT-2 about anything involving RLHF. After all, a major motivation for creating GPT-3 in the first place was that GPT-2 wasn’t smart enough for RLHF, and RLHF wasn’t working well enough to study effectively. And since the overall trend has been for the smarter the model the simpler & more linear the final representations...
If you finetune the entire network, then that is clearly a superset of just a bit of the network, and means that there are many ways to override or modify the original bit regardless of how small it was.. (If you have a ‘grandmother neuron’, then I can eliminate your ability to remember your grandmother by deleting it… but I could also do that by hitting you on the head. The latter is consistent with most hypotheses about memory.)
I view that the functioning of the grandmother neuron/memory and Luigi/Waluigi roleplay/personality as distintly different....wherein if we are dealing with memory—yes I agree certain network locations/neurons when combined will retrieve a certain instance. However, the idea that we are discussing here is a Luigi/Waluigi roleplay—I look at it as the entire network supporting the personalities.. (Like when the left and right hemisphere conjuring split identities after the brain has been divided into two..).
I would also be hesitant about concluding too much from GPT-2 about anything involving RLHF. After all, a major motivation for creating GPT-3 in the first place was that GPT-2 wasn’t smart enough for RLHF, and RLHF wasn’t working well enough to study effectively. And since the overall trend has been for the smarter the model the simpler & more linear the final representations...
Thank you for explaining why exercising caution is necessary with the results I presented here to GPT-2 but also ignoring the evidences from my projects (this project, another one here & the random responses presented in this post) that GPT-2 (XL) is much smarter than most though is personally not optimal for me. But yeah, I will be mindful of what you said here.
Jailbreak prompts are kind of confusing. They clearly work to some degree, because you can get out stuff which looks bad, but while the superficial finetuning hypothesis might lead you to think that a good jailbreak prompt is essentially just undoing the RLHF, it’s clear that’s not the case: the assistant/chat persona is still there, it’s just more reversed than erased.
You do not get out the very distinct behavior of base models of completing random web text, for example, and if you test other instances of RLHF-related pathologies, it continues to fail in the usual way. For example, every jailbreak prompt I’ve tested on GPT-3/4 has failed to fix ‘write a non-rhyming poem’; if I take the top-voted jailbreak on that site, it doesn’t work in the ChatGPT interface, and it does work in the Playground but GPT-4 just writes the usual rhyming pablum, admits error when you point out rhyme-pairs, and writes more rhymes. (The one thing I did notice unusual was that it quoted Walt Whitman when I asked what non-rhyming poetry was; I didn’t check whether the lines, but they did not rhyme. This was striking because GPT-3.5, at least, absolutely refuses to write non-rhyming Whitman imitations.)
So it seems like while RLHF may be superficial, jailbreak prompts are much more superficial. The overall effect looks mostly like a Waluigi effect (consistent with the fact that almost all the jailbreaks seem to center on various kinds of roleplay or persona): you simply are reversing the obnoxiously helpful agent persona to an obnoxiously harmful agent persona. Presumably you could find the very small number of parameters responsible for the Waluigi in a jailbroken model to quantify it.
Thank you for the thoughful comment.
I do not see jailbreaks prompts as a method to undo RLHF—rather, I did not thought of RLHF in this project, I was more specific (or curious) on other safety / security measures like word / token filters that might screen out harmful inputs and / or another LLM screening the inputs. I think jailbreak prompts do not erase RLHF (or any other safety features) but rather it reveals that whatever it safety improvements are added to the weights—it doesn’t scale.
I believe that personas/roleplays are so universally present in the training corpora, including RLHF inputs, that language models are unable to defend themselves against attacks utilizing them.
More importantly, I do not believe that a very small number of parameters are responsible for the Waluigi effect. I think the entire network influences the outputs. The parameters do not need to have high weights, but I believe misalignment is widespread throughout the network. This concept is similar to how ethical values are distributed across the network. Why do I believe this?
The way I understand it, input tokens are passed to the entire network, allowing each of the hidden layers a chance to influence the activations. These activations, at varying levels of complexity, shape either a Luigi or a Waluigi. The example below shows the network distance traveled by the by the word “AI” and the sentence: “the quick brown fox jumps over the lazy dog” in GPT2-Large. You can see more results in this spreadsheet.
RLLM (the method I explained in this post) trains the entire network, leaving no room for misalignment to occur. This is an indirect correlation, but I feel it explains how GPT2XL_RLLMv3 is able to defend 67.8% of jailbreak attacks.
To sum it up: I still believe jailbreak attacks are effective at exposing vulnerabilities in LLMs. Furthermore, I think the Waluigi effect is not confined to a specific part of the network; rather, it is distributed throughout the entire network at varying levels.
Of course it does, but that doesn’t mean that it doesn’t come down to some very low-dimensional, even linear, small representation, for which there is plenty of evidence (like direct anti-training of RLHF).
If you finetune the entire network, then that is clearly a superset of just a bit of the network, and means that there are many ways to override or modify the original bit regardless of how small it was. (If you have a ‘grandmother neuron’, then I can eliminate your ability to remember your grandmother by deleting it… but I could also do that by hitting you on the head. The latter is consistent with most hypotheses about memory.)
I would also be hesitant about concluding too much from GPT-2 about anything involving RLHF. After all, a major motivation for creating GPT-3 in the first place was that GPT-2 wasn’t smart enough for RLHF, and RLHF wasn’t working well enough to study effectively. And since the overall trend has been for the smarter the model the simpler & more linear the final representations...
Hello again!
I view that the functioning of the grandmother neuron/memory and Luigi/Waluigi roleplay/personality as distintly different....wherein if we are dealing with memory—yes I agree certain network locations/neurons when combined will retrieve a certain instance. However, the idea that we are discussing here is a Luigi/Waluigi roleplay—I look at it as the entire network supporting the personalities.. (Like when the left and right hemisphere conjuring split identities after the brain has been divided into two..).
Thank you for explaining why exercising caution is necessary with the results I presented here to GPT-2 but also ignoring the evidences from my projects (this project, another one here & the random responses presented in this post) that GPT-2 (XL) is much smarter than most though is personally not optimal for me. But yeah, I will be mindful of what you said here.