Buck comments on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Buck 7 Jul 2024 18:25 UTC
LW: 33 AF: 22
3
AF
I think people who read A Mathematical Framework should note that its mathematical claim about one-layer transformers being equivalent to skip-trigrams are IMO wrong and many people interpret the induction head hypothesis as being much stronger than evidence supports.
(I think that many other claims in the paper are pretty dubious, e.g. the stuff about interpreting models as sums of paths is IMO dubious because there is a softmax nonlinearity after these paths, but I have never gotten around to writing this up and probably never will.)
- Neel Nanda 7 Jul 2024 19:32 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Fair point, I’ll add that in to the post. The main reason I recommend it so highly and prominently is that I think it builds valuable conceptual frameworks for reasoning about the pieces of a transformer, even if it somewhat overclaims on how far it can get on interpreting tiny attention-only models, and I think those broad intuitions still stand even after your critiques. Eg strict induction heads as an example of the kind of algorithm that can be implemented with attention, even if it’s not fully faithful to the underlying model. But I agree that these are worthwhile caveats to have in mind when reading, and the paper shouldn’t be blindly recommended.
  - Buck 7 Jul 2024 20:19 UTC
    LW: 10 AF: 7
    3
    AF Parent
    Thanks! I agree that thinking through the idealized induction head algorithm seems healthy, but I think it seems important to know that that algorithm isn’t much of what those heads are actually doing!