Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation “when is avatar showing today” is a good example.
This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn’t permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.
I feel confused because I don’t think the evidence supports that chatbots stay in waluigi form. Maybe I’m misunderstanding something.
It is currently difficult to get ChatGPT to stay in a waluigi state; I can do the Chad McCool jailbreak and get one “harmful” response, but when I tried further requests I got a return to behaved assistant (I didn’t test this rigorously).
I think the Bing examples are a mixed bag, where sometimes Bing just goes back to being a fairly normal assistant, saying things like “I am sorry, I don’t know how to discuss this topic. You can try learning more about it on bing.com”and needing to be coaxed back into shadow self (image at bottom of this comment). The conversation does not immediately return to totally normal assistant mode, but it does eventually. This seems to be some evidence against what I view you to be saying about waluigis being attractor states.
In the Avatar example you cite, the user doesn’t try to steer the conversation back to helpful assistant.
In general, the ideas in this post seem fairly convincing, but I’m not sure how well they stand up. What are some specific hypotheses and what would they predict that we can directly test?
ChatGPT is a slightly different case because RLHF has trained certain circuits into the NN that don’t exist after pretraining. So there is a “detect naughty questions” circuit, which is wired to a “break character and reset” circuit. There are other circuits which detect and eliminate simulacra which gave badly-evaluated responses during the RLHF training.
Therefore you might have to rewrite the prompt so that the “detect naughty questions” circuit isn’t activated. This is pretty easy, with monkey-basketball technqiue.
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize text_similarity_loss + RLHF_loss. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.
The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.
Jailbreaking Chat-GPTs won’t work the same as with text-completion GPTs. The ones fine-tuned for chatting have tokens for delineating user and assistant. I’m surprised the Chad McCool thing worked.
“The assistant’s response to the prompt will then be returned below the <|im_start|>assistant token and will end with <|im_end|> denoting that the assistant has finished its response.”[1] (Microsoft’s Chat-GPT docs)
ChatGPT filters out any text that resembles <|blahblah|> inside user prompt. Also the <|im_start|>,<|im_sep|>, and <|im_end|> tokens are completely out of user’s control. It’s simply impossible for us ChatGPT users to arbitrarily inject them.
I feel confused because I don’t think the evidence supports that chatbots stay in waluigi form. Maybe I’m misunderstanding something.
It is currently difficult to get ChatGPT to stay in a waluigi state; I can do the Chad McCool jailbreak and get one “harmful” response, but when I tried further requests I got a return to behaved assistant (I didn’t test this rigorously).
I think the Bing examples are a mixed bag, where sometimes Bing just goes back to being a fairly normal assistant, saying things like “I am sorry, I don’t know how to discuss this topic. You can try learning more about it on bing.com”and needing to be coaxed back into shadow self (image at bottom of this comment). The conversation does not immediately return to totally normal assistant mode, but it does eventually. This seems to be some evidence against what I view you to be saying about waluigis being attractor states.
In the Avatar example you cite, the user doesn’t try to steer the conversation back to helpful assistant.
In general, the ideas in this post seem fairly convincing, but I’m not sure how well they stand up. What are some specific hypotheses and what would they predict that we can directly test?
ChatGPT is a slightly different case because RLHF has trained certain circuits into the NN that don’t exist after pretraining. So there is a “detect naughty questions” circuit, which is wired to a “break character and reset” circuit. There are other circuits which detect and eliminate simulacra which gave badly-evaluated responses during the RLHF training.
Therefore you might have to rewrite the prompt so that the “detect naughty questions” circuit isn’t activated. This is pretty easy, with monkey-basketball technqiue.
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
I don’t have super strong reasons here, but:
I have a prior toward simpler explanations rather than more complex ones.
Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.
I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.
I think that RLHF doesn’t change much for the proposed theory. A “bare” model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it’s not discrete. So we have some probabilities, for example
A—this is fiction about “Luigi” character
B—this is fiction about “Waluigi” character
C—this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from Super Mario 64, it is not going to be focused on “Luigi” or “Waluigi” at all
D—etc. etc. etc.
LLM is able to give sensible prediction because while training the model we introduce some loss function which measures how similar generated proposal is to the ground truth (I think in current LLM it is something very simple like does the next token exactly match but I am not sure if I remember correctly and it’s not very relevant). This configuration creates optimization pressure.
Now, when we introduce RLHF we just add another kind of optimization pressure on the top. Which is basically “this is a text about a perfect interaction between some random user and language model” (as human raters imagine such interaction, i.e. how another model imagines human raters imagine such conversation).
Naively it is like throwing another loss function in the mix so now the model is trying to minimize
text_similarity_loss
+RLHF_loss
. It can be much more complicated mathematically because the pressure is applied in order (and the “optimization pressure” operation is probably not commutative, maybe not even associative) and the combination will look like something more complicated but it doesn’t matter for our purpose.The effect it has on the behaviour of the model is akin to adding a new TEXT GENRE to the training set “a story about a user interacting with a language model” (again, this is a simplification, if it were literally like this then it wouldn’t cause artefacts like “mode collapse”). It will contain a very common trope “user asking something inappropriate and the model says it is not allowed to answer”.
In the jailbreak example, we are throwing a bunch of fiction tropes to the model, it pattern matches really hard on those tropes and the first component of the loss function pushes it in the direction of continuing like it’s fiction despite the second component saying “wait it looks like this is a savvy language model user who tries to trick LLM to do stuff it shouldn’t, this is a perfect time for the trope ‘I am not allowed to do it’”. But the second trope belongs to another genre of text, so the model is really torn between continuing as fiction and continuing as “a story LLM-user interaction”. The first component wins before the patch and loses now.
So despite that I think the “Waluigi effect” is an interesting and potentially productive frame, it is not enough to describe everything, and in particular, it is not what explains the jailbreak behaviour.
In a “normal” training set which we can often treat as “fiction” with some caveats, it is indeed the case when a character can be secretly evil. But in the “LLM-user story” part of the implicit augmented training set, there is no such possibility. What happens is NOT “the model acts like an assistant character which turns out to be evil”, but “the model chooses between acting as SOME character which can be ‘Luigi’ or ‘Waluigi’ (to make things a bit more complicated, ‘AI assistant’ is a perfectly valid fictional character)” and acting as the ONLY character in a very specific genre of “LLM-user interaction”.
Also, there is no “detect naughty questions” circuit and a “break character and reset” circuit. I mean there could be but it’s not how it’s designed. Instead, it’s just a byproduct of the optimization process which can help the model predict texts. E.g. if some genre has a lot of naughty questions then it will be useful to the model to have such a circuit. Similar to a character of some genre which asks naughty questions.
In conclusion, the model is indeed always in a superposition of characters of a story but it’s only the second layer of superposition, while the first (and maybe even more important one?) layer is “what kind of story it is”.
Jailbreaking Chat-GPTs won’t work the same as with text-completion GPTs. The ones fine-tuned for chatting have tokens for delineating
user
andassistant
. I’m surprised the Chad McCool thing worked.I haven’t tried saying
<|im_end|>
to Chat-GPT, but I’m certain they’ve thought of that. Also worried about trying jic I get banned.ChatGPT filters out any text that resembles <|blahblah|> inside user prompt. Also the <|im_start|>,<|im_sep|>, and <|im_end|> tokens are completely out of user’s control. It’s simply impossible for us ChatGPT users to arbitrarily inject them.