Neel Nanda comments on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Neel Nanda 31 Dec 2023 14:47 UTC
LW: 5 AF: 4
1
AF
I really like this paper! This is one of my favourite interpretability papers of 2022, and has substantially influenced my research. I voted at 9 in the annual review. Specific things I like about it:
- It really started the “narrow distribution” focused interpretability, just examining models on sentences of the form “John and Mary went to the store, John gave a bag to” → ” Mary”. IMO this is a promising alternative focus to the “understand what model components mean on the full data distribution” mindset, and worth some real investment in. Model components often do many things in different contexts (are polysemantic), and narrow distributions allow us to ignore their other roles.
  - This is less ambitious and less useful than full distribution interp, but may be much easier, and still sufficient for useful applications of interp like debugging model failures (eg why does BingChat gaslight people) or creating adversarial examples.
- It really pushed forwards causal intervention based mech interp (ie activation patching), rather than “analysing weights based” mech interp. Causal interventions are inherently distribution dependent and in some sense less satisfying, but much more scalable, and an important tool in our toolkit. (eg they kinda just worked on Chinchilla 70B
  - Patching was not original to IOI, but IOI is the first time I saw someone actually try to use it to uncover a circuit
  - It was the first place I saw edge/path patching, which is a cool and important innovation on the technique. It’s a lot easier to interpret a set of key nodes and how they connect up than just heads that matter in isolation.
- It’s really useful to have an example of a concrete circuit when reasoning through mech interp! I often use IOI as a standard example when teaching or thinking through something
- When you go looking inside a model you see weird phenomena, which is valuable to know about in future—the role of the work is by giving existence proofs of these, so just a single example is sufficient
  - It was the first place I saw the phenomena of backup/self-repair, which I found highly unexpected
  - It was the first place I saw negative heads (which directly led to the copy suppression paper I supervised, one of my favourite interp papers of 2023!)
- It’s led to a lot of follow-up works trying to uncover different circuits. I think this line of research is hitting diminishing returns, but I’d still love to eg have a zoo of at least 10 circuits in small to medium language models!
- This was the joint first mech interp work published at a top ML conference, which seems like solid field-building, with more than 100 citations in the past 14 months!
I personally didn’t find the paper that easy to read, and tend to recommend people read other resources to understand the techniques used, and I’d guess it suffered somewhat from trying to conform to peer review. But idk, the above is just a lot of impressive things for a single paper!
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)