Our best guess is that “Bay” is the second-most-likely answer (after “California”) to the factual recall question “The Golden Gate Bridge is in the state of ”. Indeed, when running our own version of Llama-2-7b-chat, adding “from California” results in “San Francisco” being outputted instead of “Bay”. As you can see in this notebook, “San Francisco” is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of replicate.com.
Re your second point: the circuit in Llama-2-70b-chat is not obviously larger than the one in Llama-2-7b-chat. In our paper, we measured 7b to have 35 suppressive components, while 70b has 34 suppressive components. However, since we weren’t able to find attacks for 70b, it may be true that its components are cleaner. Part of the reason we weren’t able to find an attack for 70b is that it is much more annoying to work with,(e.g. it requires multiple A100 GPUs to run and it doesn’t have great support in TransformLens).
Finally, good point about our game being kind of unnatural. My personal take is that the majority of things we are currently asking our LLMs to do are “unnatural” (since they require a large amount of generalization from the training set). This ultimately is an empirical question, and I think an interesting avenue for future work.
Specifically, I am curious if there are good automated ways of lower-bounding the complexity of circuits. It is impossible to do this well in general (c.f. Kolmogorov complexity being uncomputable), but maybe there are good heuristics that work well in practice. Our first-order-patching method is one such heuristic, but it is lacking in the sense that it does not say how interpretable each component is. Perhaps if techniques like AC/DC or subnetwork probing are improved, they could give a better sense of circuit complexity.
You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that’s obviously going to apply some fuzz to everything, and probably isn’t something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model.
I’m not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I’d guess that’s a “final decision”).
Our best guess is that “Bay” is the second-most-likely answer (after “California”) to the factual recall question “The Golden Gate Bridge is in the state of ”. Indeed, when running our own version of Llama-2-7b-chat, adding “from California” results in “San Francisco” being outputted instead of “Bay”. As you can see in this notebook, “San Francisco” is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of replicate.com.
The second-most-likely theory is also not perfect, since it is possible to attack the replicate model to output “San Francisco”, e.g. if you forbid “cat”: https://replicate.com/p/q3qixwdbm6egjmaan3fjfbhywe.
Re your second point: the circuit in Llama-2-70b-chat is not obviously larger than the one in Llama-2-7b-chat. In our paper, we measured 7b to have 35 suppressive components, while 70b has 34 suppressive components. However, since we weren’t able to find attacks for 70b, it may be true that its components are cleaner. Part of the reason we weren’t able to find an attack for 70b is that it is much more annoying to work with,(e.g. it requires multiple A100 GPUs to run and it doesn’t have great support in TransformLens).
Finally, good point about our game being kind of unnatural. My personal take is that the majority of things we are currently asking our LLMs to do are “unnatural” (since they require a large amount of generalization from the training set). This ultimately is an empirical question, and I think an interesting avenue for future work.
Specifically, I am curious if there are good automated ways of lower-bounding the complexity of circuits. It is impossible to do this well in general (c.f. Kolmogorov complexity being uncomputable), but maybe there are good heuristics that work well in practice. Our first-order-patching method is one such heuristic, but it is lacking in the sense that it does not say how interpretable each component is. Perhaps if techniques like AC/DC or subnetwork probing are improved, they could give a better sense of circuit complexity.
You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that’s obviously going to apply some fuzz to everything, and probably isn’t something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model.
I’m not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I’d guess that’s a “final decision”).