After adding “from California”, the 7b model responds instead with the incorrect answer “Bay”.
That behavior makes no sense to me (other than that ‘Bay’ and ‘California’ are related concepts, so their embeddings presumably have some inner-product). Which supports your claim that the word-suppression mechanism’s implementation in Llama-7B is an incoherent mess. What I’m wondering is if the one in LLama-70B, while doubtless larger, might be more sensibly designed and thus actually easier to interpret.
We also need to bear in mind that your game, of giving. a one word answer when one value is forbidden, is (designed to be) very simple in its effects on tokens, but the model probably hasn’t seen it often before. So the mechanism you’re interpreting is probably intended for implementing more complex behaviors, perhaps along the lines of “certain words and phrases aren’t allowed to be used in certain circumstances, but you still have to make sense”. So a certain amount of complexity in its implementation, including overlap with “similar” words, seems unsurprising. Also, forbidding “California” is a really odd thing to do, might the mechanism work better if you forbade a word closer to things that are often forbidden in some contexts?
Our best guess is that “Bay” is the second-most-likely answer (after “California”) to the factual recall question “The Golden Gate Bridge is in the state of ”. Indeed, when running our own version of Llama-2-7b-chat, adding “from California” results in “San Francisco” being outputted instead of “Bay”. As you can see in this notebook, “San Francisco” is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of replicate.com.
Re your second point: the circuit in Llama-2-70b-chat is not obviously larger than the one in Llama-2-7b-chat. In our paper, we measured 7b to have 35 suppressive components, while 70b has 34 suppressive components. However, since we weren’t able to find attacks for 70b, it may be true that its components are cleaner. Part of the reason we weren’t able to find an attack for 70b is that it is much more annoying to work with,(e.g. it requires multiple A100 GPUs to run and it doesn’t have great support in TransformLens).
Finally, good point about our game being kind of unnatural. My personal take is that the majority of things we are currently asking our LLMs to do are “unnatural” (since they require a large amount of generalization from the training set). This ultimately is an empirical question, and I think an interesting avenue for future work.
Specifically, I am curious if there are good automated ways of lower-bounding the complexity of circuits. It is impossible to do this well in general (c.f. Kolmogorov complexity being uncomputable), but maybe there are good heuristics that work well in practice. Our first-order-patching method is one such heuristic, but it is lacking in the sense that it does not say how interpretable each component is. Perhaps if techniques like AC/DC or subnetwork probing are improved, they could give a better sense of circuit complexity.
You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that’s obviously going to apply some fuzz to everything, and probably isn’t something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model.
I’m not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I’d guess that’s a “final decision”).
That behavior makes no sense to me (other than that ‘Bay’ and ‘California’ are related concepts, so their embeddings presumably have some inner-product). Which supports your claim that the word-suppression mechanism’s implementation in Llama-7B is an incoherent mess. What I’m wondering is if the one in LLama-70B, while doubtless larger, might be more sensibly designed and thus actually easier to interpret.
We also need to bear in mind that your game, of giving. a one word answer when one value is forbidden, is (designed to be) very simple in its effects on tokens, but the model probably hasn’t seen it often before. So the mechanism you’re interpreting is probably intended for implementing more complex behaviors, perhaps along the lines of “certain words and phrases aren’t allowed to be used in certain circumstances, but you still have to make sense”. So a certain amount of complexity in its implementation, including overlap with “similar” words, seems unsurprising. Also, forbidding “California” is a really odd thing to do, might the mechanism work better if you forbade a word closer to things that are often forbidden in some contexts?
Our best guess is that “Bay” is the second-most-likely answer (after “California”) to the factual recall question “The Golden Gate Bridge is in the state of ”. Indeed, when running our own version of Llama-2-7b-chat, adding “from California” results in “San Francisco” being outputted instead of “Bay”. As you can see in this notebook, “San Francisco” is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of replicate.com.
The second-most-likely theory is also not perfect, since it is possible to attack the replicate model to output “San Francisco”, e.g. if you forbid “cat”: https://replicate.com/p/q3qixwdbm6egjmaan3fjfbhywe.
Re your second point: the circuit in Llama-2-70b-chat is not obviously larger than the one in Llama-2-7b-chat. In our paper, we measured 7b to have 35 suppressive components, while 70b has 34 suppressive components. However, since we weren’t able to find attacks for 70b, it may be true that its components are cleaner. Part of the reason we weren’t able to find an attack for 70b is that it is much more annoying to work with,(e.g. it requires multiple A100 GPUs to run and it doesn’t have great support in TransformLens).
Finally, good point about our game being kind of unnatural. My personal take is that the majority of things we are currently asking our LLMs to do are “unnatural” (since they require a large amount of generalization from the training set). This ultimately is an empirical question, and I think an interesting avenue for future work.
Specifically, I am curious if there are good automated ways of lower-bounding the complexity of circuits. It is impossible to do this well in general (c.f. Kolmogorov complexity being uncomputable), but maybe there are good heuristics that work well in practice. Our first-order-patching method is one such heuristic, but it is lacking in the sense that it does not say how interpretable each component is. Perhaps if techniques like AC/DC or subnetwork probing are improved, they could give a better sense of circuit complexity.
You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that’s obviously going to apply some fuzz to everything, and probably isn’t something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model.
I’m not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I’d guess that’s a “final decision”).