A key question is whether behaviors of interest in these large scale settings are tractable to study.
We provide some evidence in the negative, and show that even simple word suppression in Llama-2 may be computationally irreducible. Our evidence is the existence of adversarial examples for the word suppression behavior.
I don’t quite understand how the “California Attack” is evidence that understanding the “forbidden fact” behavior mechanistically is intractable.
In fact, it seems like the opposite to me. At the end of section 3 of the paper, you examine attention patterns of suppressor heads and find that they exhibit “key semantic specificity, … [privileging] correct answers to the factual recall over all other keys” (rather than privileging the forbidden word, as one would expect). The “California Attack” then comes out of this mechanistic insight—the mechanistic understanding of suppressor head attention patterns informs the adversarial attack, and predicts the resulting behavior. This seems like the opposite of computational irreducibility to me!
The attention analysis and the attack both serve as good evidence that the model uses this heuristic. Faulty or not, this is a heuristic that the model uses, and knowing this gives us insight into understanding how the model is (imperfectly) performing the task mechanistically.
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:
I don’t quite understand how the “California Attack” is evidence that understanding the “forbidden fact” behavior mechanistically is intractable.
This depends on your definition of “understanding” and your definition of “tractable”. If we take “understanding” to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack[1].
However, if we define “understanding” as “having a faithful[2] description of behavior to the level of always accurately predicting the most-likely next token” and “tractability” as “the description fitting within a 100 page arXiv paper”, then I would say that the California Attack is evidence that understanding the “forbidden fact” behavior is intractable. This is because the California Attack is actually quite finicky—sometimes it works and sometimes it doesn’t, and I don’t believe one can fit in 100 pages the rule that determines all the cases in which it works and all the cases in which it doesn’t.
Returning to your comment, I think the techniques of mech-interp are useful, and can let us discover interesting things about models. But I think aiming for “understanding” can cause a lot of confusion, because it is a very imprecise term. Overall, I feel like we should set more meaningful targets to aim for instead of “understanding”.
Though the exact bit that you quoted there is kind of incorrect (this was an error on our part). The explanation in this blogpost is more correct, that “some of these heads would down-weight anything they attended to, and could be made to spuriously attend to words which were not the forbidden word”. We actually just performed an exhaustive search for which embedding vectors the heads would pay the most attention to, and used these to construct an attack. We have since amended in the newest version of the paper (just updated yesterday) to reflect this.
By faithfulness, I mean a description that matches the actual behavior of the phenomena. This is similar to the definition given in arxiv.org/abs/2211.00593. This is also not a very precisely defined term, because there is wiggle room. For example, is the float16 version of a network a faithful description of the float32 version of the network? For AI safety purposes, I feel like faithfulness should at least capture behaviors like the California Attack and phenomena like jailbreaks and prompt injections.
Kudos on the well-written paper and post!
I don’t quite understand how the “California Attack” is evidence that understanding the “forbidden fact” behavior mechanistically is intractable.
In fact, it seems like the opposite to me. At the end of section 3 of the paper, you examine attention patterns of suppressor heads and find that they exhibit “key semantic specificity, … [privileging] correct answers to the factual recall over all other keys” (rather than privileging the forbidden word, as one would expect). The “California Attack” then comes out of this mechanistic insight—the mechanistic understanding of suppressor head attention patterns informs the adversarial attack, and predicts the resulting behavior. This seems like the opposite of computational irreducibility to me!
The attention analysis and the attack both serve as good evidence that the model uses this heuristic. Faulty or not, this is a heuristic that the model uses, and knowing this gives us insight into understanding how the model is (imperfectly) performing the task mechanistically.
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:
This depends on your definition of “understanding” and your definition of “tractable”. If we take “understanding” to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack[1].
However, if we define “understanding” as “having a faithful[2] description of behavior to the level of always accurately predicting the most-likely next token” and “tractability” as “the description fitting within a 100 page arXiv paper”, then I would say that the California Attack is evidence that understanding the “forbidden fact” behavior is intractable. This is because the California Attack is actually quite finicky—sometimes it works and sometimes it doesn’t, and I don’t believe one can fit in 100 pages the rule that determines all the cases in which it works and all the cases in which it doesn’t.
Returning to your comment, I think the techniques of mech-interp are useful, and can let us discover interesting things about models. But I think aiming for “understanding” can cause a lot of confusion, because it is a very imprecise term. Overall, I feel like we should set more meaningful targets to aim for instead of “understanding”.
Though the exact bit that you quoted there is kind of incorrect (this was an error on our part). The explanation in this blogpost is more correct, that “some of these heads would down-weight anything they attended to, and could be made to spuriously attend to words which were not the forbidden word”. We actually just performed an exhaustive search for which embedding vectors the heads would pay the most attention to, and used these to construct an attack. We have since amended in the newest version of the paper (just updated yesterday) to reflect this.
By faithfulness, I mean a description that matches the actual behavior of the phenomena. This is similar to the definition given in arxiv.org/abs/2211.00593. This is also not a very precisely defined term, because there is wiggle room. For example, is the float16 version of a network a faithful description of the float32 version of the network? For AI safety purposes, I feel like faithfulness should at least capture behaviors like the California Attack and phenomena like jailbreaks and prompt injections.