Jason Gross comments on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jason Gross 11 Jul 2024 0:14 UTC
LW: 4 AF: 2
1
AF
Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there’s just a ton of gorgeous results in there. I think it’s the most (only?) truly rigorous reverse-engineering work out there
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re “most (only?) truly rigorous reverse-engineering work out there”: I think the clock and pizza paper seems comparably rigorous, and there’s also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe’s heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).
What links here?
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda (7 Jul 2024 17:39 UTC; 134 points)
- Neel Nanda 11 Jul 2024 9:46 UTC
  2 points
  0
  Parent
  Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I’ve hedged the claim to “one of the most”, does that seem reasonable?
  
  I haven’t deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I’ve instead linked to your comment from the post