Arthur Conmy comments on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Arthur Conmy 20 Dec 2023 13:22 UTC
3 points
1
Thanks!
In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I’m hesitant to call this a Waluigi component—in that work we found that “Negative IOI Heads” and “Anti-Induction Heads” are not specifically about IOI or Induction at all, they’re just doing meta-processing to calibrate outputs.
Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely