Tony Wang comments on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang 2 Jan 2024 22:54 UTC
2 points
0
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:
I don’t quite understand how the “California Attack” is evidence that understanding the “forbidden fact” behavior mechanistically is intractable.
This depends on your definition of “understanding” and your definition of “tractable”. If we take “understanding” to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack^[1].
However, if we define “understanding” as “having a faithful^[2] description of behavior to the level of always accurately predicting the most-likely next token” and “tractability” as “the description fitting within a 100 page arXiv paper”, then I would say that the California Attack is evidence that understanding the “forbidden fact” behavior is intractable. This is because the California Attack is actually quite finicky—sometimes it works and sometimes it doesn’t, and I don’t believe one can fit in 100 pages the rule that determines all the cases in which it works and all the cases in which it doesn’t.
Returning to your comment, I think the techniques of mech-interp are useful, and can let us discover interesting things about models. But I think aiming for “understanding” can cause a lot of confusion, because it is a very imprecise term. Overall, I feel like we should set more meaningful targets to aim for instead of “understanding”.
1. ^
  Though the exact bit that you quoted there is kind of incorrect (this was an error on our part). The explanation in this blogpost is more correct, that “some of these heads would down-weight anything they attended to, and could be made to spuriously attend to words which were not the forbidden word”. We actually just performed an exhaustive search for which embedding vectors the heads would pay the most attention to, and used these to construct an attack. We have since amended in the newest version of the paper (just updated yesterday) to reflect this.
2. ^
  By faithfulness, I mean a description that matches the actual behavior of the phenomena. This is similar to the definition given in arxiv.org/abs/2211.00593. This is also not a very precisely defined term, because there is wiggle room. For example, is the float16 version of a network a faithful description of the float32 version of the network? For AI safety purposes, I feel like faithfulness should at least capture behaviors like the California Attack and phenomena like jailbreaks and prompt injections.