Tony Wang comments on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang 20 Dec 2023 5:08 UTC
2 points
1
The Waluigi Effect is defined by Cleo Nardo as follows:
The Waluigi Effect: After you train an LLM to satisfy a desirable property $P$ , then it’s easier to elicit the chatbot into satisfying the exact opposite of property $P$ .
For our project, we prompted Llama-2-chat models to satisfy the property $P$ that they would downweight the correct answer when forbidden from saying it. We found that 35 residual stream components were necessary to explain the models average tendency to do $P$ .
However, in addition to these 35 suppressive components, there were also some components which demonstrated a promotive effect. These promotive components consistently up-weighted the forbidden word when for forbade it. We called these components “Waluigi components” because they acted against the instructions in the prompt.
Wherever the Waluigi effect holds, one should expect such “Waluigi components” to exist.

See the following plots for what I mean by suppressive and promotive heads (I just generated these, they are not in the paper):
- Arthur Conmy 20 Dec 2023 13:22 UTC
  3 points
  1
  Parent
  Thanks!
  In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I’m hesitant to call this a Waluigi component—in that work we found that “Negative IOI Heads” and “Anti-Induction Heads” are not specifically about IOI or Induction at all, they’re just doing meta-processing to calibrate outputs.
  Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely