Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there’s just a ton of gorgeous results in there. I think it’s the most (only?) truly rigorous reverse-engineering work out there
Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I’ve hedged the claim to “one of the most”, does that seem reasonable?
I haven’t deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I’ve instead linked to your comment from the post
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re “most (only?) truly rigorous reverse-engineering work out there”: I think the clock and pizza paper seems comparably rigorous, and there’s also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe’s heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).
Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I’ve hedged the claim to “one of the most”, does that seem reasonable?
I haven’t deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I’ve instead linked to your comment from the post