The Waluigi Effect is defined by Cleo Nardo as follows:
The Waluigi Effect: After you train an LLM to satisfy a desirable property , then it’s easier to elicit the chatbot into satisfying the exact opposite of property .
For our project, we prompted Llama-2-chat models to satisfy the property that they would downweight the correct answer when forbidden from saying it. We found that 35 residual stream components were necessary to explain the models average tendency to do .
However, in addition to these 35 suppressive components, there were also some components which demonstrated a promotive effect. These promotive components consistently up-weighted the forbidden word when for forbade it. We called these components “Waluigi components” because they acted against the instructions in the prompt.
Wherever the Waluigi effect holds, one should expect such “Waluigi components” to exist.
See the following plots for what I mean by suppressive and promotive heads (I just generated these, they are not in the paper):
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:
This depends on your definition of “understanding” and your definition of “tractable”. If we take “understanding” to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack[1].
However, if we define “understanding” as “having a faithful[2] description of behavior to the level of always accurately predicting the most-likely next token” and “tractability” as “the description fitting within a 100 page arXiv paper”, then I would say that the California Attack is evidence that understanding the “forbidden fact” behavior is intractable. This is because the California Attack is actually quite finicky—sometimes it works and sometimes it doesn’t, and I don’t believe one can fit in 100 pages the rule that determines all the cases in which it works and all the cases in which it doesn’t.
Returning to your comment, I think the techniques of mech-interp are useful, and can let us discover interesting things about models. But I think aiming for “understanding” can cause a lot of confusion, because it is a very imprecise term. Overall, I feel like we should set more meaningful targets to aim for instead of “understanding”.
Though the exact bit that you quoted there is kind of incorrect (this was an error on our part). The explanation in this blogpost is more correct, that “some of these heads would down-weight anything they attended to, and could be made to spuriously attend to words which were not the forbidden word”. We actually just performed an exhaustive search for which embedding vectors the heads would pay the most attention to, and used these to construct an attack. We have since amended in the newest version of the paper (just updated yesterday) to reflect this.
By faithfulness, I mean a description that matches the actual behavior of the phenomena. This is similar to the definition given in arxiv.org/abs/2211.00593. This is also not a very precisely defined term, because there is wiggle room. For example, is the float16 version of a network a faithful description of the float32 version of the network? For AI safety purposes, I feel like faithfulness should at least capture behaviors like the California Attack and phenomena like jailbreaks and prompt injections.