Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?
Both of these show slightly different things. Imagine an “AND circuit” where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of “OR circuit”. I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda’s tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information.
did you find that the selection of [the corrupt words] mattered?
Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch param and naively replaced it with arg, not realizing that param is treated specially! Here is a plot of head 0.2′s attention pattern; it behaves differently for certain tokens. Another example is the self token: It is treated very differently to the variable name tokens.
So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.
Hi, and thanks for the comment!
Both of these show slightly different things. Imagine an “AND circuit” where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of “OR circuit”. I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda’s tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information.
Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch
param
and naively replaced it witharg
, not realizing thatparam
is treated specially! Here is a plot of head 0.2′s attention pattern; it behaves differently for certain tokens. Another example is theself
token: It is treated very differently to the variable name tokens.So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.