As with the CCS post, I’m reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --
There’s a lot of reasons I like the work. It’s an example of:
Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replacement, and Nix completed his work on the parentheses balancer concurrently to the IOI results, but insofar as there was mech interp work being done, most of it was on simple models such as the ones featured in Toy Models of Superposition or Modular Arithmetic Grokking (with the primary exception being the Induction Head results from Anthropic, which are substantively weaker outside of very small transformers).
This work was one of the first attempts to explain a particular, nontrivial behavior inside of a small but real LM (GPT-2-small).
Demonstrating the feasibility of patching and circuit-based analysis on language models. I think it’s notable that this work doesn’t just mechanistically study behavior inside of a language model, it finds a circuit (a small subgraph) implementing the behavior. This is valuable both as a confirmation that patching could be used to find circuits in “real” models, but also as evidence that we can find these circuits at all. In turn, this has led to a veritable explosion of “poking LLMs with various kinds of patching/scrubbing to identify subgraphs of particular behaviors”, which I think has been pretty valuable on net.
Also, as Neel says below, it’s important for pedagogical reasons.
Field-building via example. As with the Modular Arithmetic work by Neel, this was published in ICLR ’23 as the joint first mech interp work to be published in a top conference. This helped build a substantial amount of legitimacy and academic interest for the field of mech interp (and broadly, ai x-risk flavored interp in general).
Demonstrating failure modes and limitations of mech interp techniques. As stated in this post, an earlier version of this work used mean ablation in a way that preserved “information that helped compute the task”, which incorrectly suggested that parts of the circuit were unimportant for performance. It’s a concrete example of why important to think about what exactly you’re ablating, and how your ablation serves as a valid test of your hypothesis.
This work also directly inspired Causal Scrubbing , which was an attempt to more completely remove information that helps complete a task.
Validating interp via adversarial inputs. I appreciate the use of adversarial example discovery as a downstream use case of the interp.
But there’s also some reservations I have:
Some of the presentation was misleading. Originally, the paper defined the IOI task as something along the lines of: ’… sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary”.’ That is, it did not make it clear that IOI was about assigning a higher logit to “Mary” than to “John”, and not about assigning an (absolutely) high logit to “Mary”. IIRC, this was only clarified near the end thanks to the effort of one of the critical ICLR reviewers.[2] There were also other strong claims that were significantly ameliorated by the ICLR review process.[3]
The circuit is likely overfit to the metric. I think that the mean logit difference is indeed the correct metric to look at, both because of how the task was defined and also for many use cases in general.[4] However, it’s worth noting that this circuit does not hold up well if we replace the mean logit difference with other superficially similar metrics. E.g. if you replace the metric with mean absolute logit difference (i.e. E[|logit diff_model—logit diff_subgraph|]).
The circuit is likely incomplete. Running Causal Scrubbing on the hypothesis suggests that it is importantly incomplete, see for example Alexandre’s comment below. The incompleteness of the circuit also suggests some limitations of node-based causal interventions (i.e. activation patching in this case), as previously discussed. That being said, this wasn’t something that could’ve really been known when the experiments were being done for this paper, as Causal Scrubbing was inspired by these results (and thus could not have been used to generate them).
And there’s two big points that I’m very, very torn on (it’s less to do with the work itself than general approaches to/issues with mech interp):
Using an algorithmic task (IOI). As this post says, it’s an example of “streetlight interpretability”—looking where at cases that are easy, as opposed to where it’s useful or realistic. I think it’s valuable to do some amount of streetlight interpretability, and it’s especially understandable in the case of this work (as one of the earliest mech interp pieces) but I do think that this is a weakness of the work. I also think that fact that seminal works in mech interp used algorithmic tasks may have contributed to a lack of attention paid to soft heuristics/memorization/n-gram statistic-style behavior inside of models, which I think are quite neglected.
Low percent performance recovered. While the headline numbers for completeness/faithfulness are pretty high in terms of percentage, this actually is quite bad in terms of downstream performance.This isn’t specific to this work. But, to use causal scrubbing as an example, if random performance on a task is 10 nats of log loss and the model’s performance is 2.1 nats, recovering up to 2.6 nats might give the impressive number of 93.7% loss recovered. But in practice, 2.6 nats might be the performance of a model 1⁄100 or 1/1000 the size of the model we’re trying to explain. If the behavior you’re trying to explain is present in the most capable models but not in models a generation or two back, this work does not provide significant evidence that it’s possible. Again, this isn’t specific to this work, but to circuit-style mech interp on real models in general.
I think the post itself is pretty good though not exceptional—I appreciate the explanation of how the task and approach were chosen, as well as the key takeaway that causal interventions can be powerful for mech interp, if they are performed appropriately, but doing them appropriately is challenging.
All said, I’m giving this a 4 on the annual review.
Note that there was plenty of non-mechanistic interp work that looked at real models and tasks; in fact, the majority of interp has always been on non-toy models and tasks. But mech interp was focused on toy tasks.
I helped out with rebuttals on this paper, and was honestly impressed by the two critical reviews posted by reviewer jy1a (official review, response to author rebuttal), who among (imo correct) issues correctly pointed out that the post was using this incorrect definition of IOI. Notice also how in the rebuttal response, they also point out the issue of using mean logit difference versus mean absolute logit difference. I think that (alongisde the RR AT paper) this was one of the reasons I updated to be more in favor of the existing academic peer review system.
The major concerns from the reviewers are the current limited limitation section and a few not-well supported/overstated claims in the paper. Request to the authors: Please update the paper to have a stronger and more critical limitation discussion, as well as substantially change the writing to justify all claims/assumptions (or not to overstate claims) in order to reflect reviewers’ comments.
The main reason is that we don’t really care about ‘noise’ when explaining good performance, e.g. from the Causal Scrubbing appendix:
Suppose that one of the drivers of the model’s behavior is noise: trying to capture the full distribution would require us to explain what causes the noise. For example, you’d have to explain the behavior of a randomly initialized model despite the model doing ‘nothing interesting’.
That being said, this claim depends greatly on the implied downstream use case of interp. E.g. if the goal is to understand failure modes, then explaining just the success is insufficient.
As with the CCS post, I’m reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --
There’s a lot of reasons I like the work. It’s an example of:
Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replacement, and Nix completed his work on the parentheses balancer concurrently to the IOI results, but insofar as there was mech interp work being done, most of it was on simple models such as the ones featured in Toy Models of Superposition or Modular Arithmetic Grokking (with the primary exception being the Induction Head results from Anthropic, which are substantively weaker outside of very small transformers).
This work was one of the first attempts to explain a particular, nontrivial behavior inside of a small but real LM (GPT-2-small).
Demonstrating the feasibility of patching and circuit-based analysis on language models. I think it’s notable that this work doesn’t just mechanistically study behavior inside of a language model, it finds a circuit (a small subgraph) implementing the behavior. This is valuable both as a confirmation that patching could be used to find circuits in “real” models, but also as evidence that we can find these circuits at all. In turn, this has led to a veritable explosion of “poking LLMs with various kinds of patching/scrubbing to identify subgraphs of particular behaviors”, which I think has been pretty valuable on net.
Also, as Neel says below, it’s important for pedagogical reasons.
Field-building via example. As with the Modular Arithmetic work by Neel, this was published in ICLR ’23 as the joint first mech interp work to be published in a top conference. This helped build a substantial amount of legitimacy and academic interest for the field of mech interp (and broadly, ai x-risk flavored interp in general).
Demonstrating failure modes and limitations of mech interp techniques. As stated in this post, an earlier version of this work used mean ablation in a way that preserved “information that helped compute the task”, which incorrectly suggested that parts of the circuit were unimportant for performance. It’s a concrete example of why important to think about what exactly you’re ablating, and how your ablation serves as a valid test of your hypothesis.
This work also directly inspired Causal Scrubbing , which was an attempt to more completely remove information that helps complete a task.
Validating interp via adversarial inputs. I appreciate the use of adversarial example discovery as a downstream use case of the interp.
But there’s also some reservations I have:
Some of the presentation was misleading. Originally, the paper defined the IOI task as something along the lines of:
’… sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary”.’
That is, it did not make it clear that IOI was about assigning a higher logit to “Mary” than to “John”, and not about assigning an (absolutely) high logit to “Mary”. IIRC, this was only clarified near the end thanks to the effort of one of the critical ICLR reviewers.[2] There were also other strong claims that were significantly ameliorated by the ICLR review process.[3]
The circuit is likely overfit to the metric. I think that the mean logit difference is indeed the correct metric to look at, both because of how the task was defined and also for many use cases in general.[4] However, it’s worth noting that this circuit does not hold up well if we replace the mean logit difference with other superficially similar metrics. E.g. if you replace the metric with mean absolute logit difference (i.e. E[|logit diff_model—logit diff_subgraph|]).
The circuit is likely incomplete. Running Causal Scrubbing on the hypothesis suggests that it is importantly incomplete, see for example Alexandre’s comment below. The incompleteness of the circuit also suggests some limitations of node-based causal interventions (i.e. activation patching in this case), as previously discussed. That being said, this wasn’t something that could’ve really been known when the experiments were being done for this paper, as Causal Scrubbing was inspired by these results (and thus could not have been used to generate them).
And there’s two big points that I’m very, very torn on (it’s less to do with the work itself than general approaches to/issues with mech interp):
Using an algorithmic task (IOI). As this post says, it’s an example of “streetlight interpretability”—looking where at cases that are easy, as opposed to where it’s useful or realistic. I think it’s valuable to do some amount of streetlight interpretability, and it’s especially understandable in the case of this work (as one of the earliest mech interp pieces) but I do think that this is a weakness of the work. I also think that fact that seminal works in mech interp used algorithmic tasks may have contributed to a lack of attention paid to soft heuristics/memorization/n-gram statistic-style behavior inside of models, which I think are quite neglected.
Low percent performance recovered. While the headline numbers for completeness/faithfulness are pretty high in terms of percentage, this actually is quite bad in terms of downstream performance.This isn’t specific to this work. But, to use causal scrubbing as an example, if random performance on a task is 10 nats of log loss and the model’s performance is 2.1 nats, recovering up to 2.6 nats might give the impressive number of 93.7% loss recovered. But in practice, 2.6 nats might be the performance of a model 1⁄100 or 1/1000 the size of the model we’re trying to explain. If the behavior you’re trying to explain is present in the most capable models but not in models a generation or two back, this work does not provide significant evidence that it’s possible. Again, this isn’t specific to this work, but to circuit-style mech interp on real models in general.
I think the post itself is pretty good though not exceptional—I appreciate the explanation of how the task and approach were chosen, as well as the key takeaway that causal interventions can be powerful for mech interp, if they are performed appropriately, but doing them appropriately is challenging.
All said, I’m giving this a 4 on the annual review.
Note that there was plenty of non-mechanistic interp work that looked at real models and tasks; in fact, the majority of interp has always been on non-toy models and tasks. But mech interp was focused on toy tasks.
I helped out with rebuttals on this paper, and was honestly impressed by the two critical reviews posted by reviewer jy1a (official review, response to author rebuttal), who among (imo correct) issues correctly pointed out that the post was using this incorrect definition of IOI. Notice also how in the rebuttal response, they also point out the issue of using mean logit difference versus mean absolute logit difference. I think that (alongisde the RR AT paper) this was one of the reasons I updated to be more in favor of the existing academic peer review system.
See e.g. this comment from the Program Chairs:
The main reason is that we don’t really care about ‘noise’ when explaining good performance, e.g. from the Causal Scrubbing appendix:
That being said, this claim depends greatly on the implied downstream use case of interp. E.g. if the goal is to understand failure modes, then explaining just the success is insufficient.