this line of work is the strongest argument for mech interp [...] having concrete capabilities externalities
I have found this claim a bit handwavy, as I could imagine state space models being invented and improved to the current stage without the prior work of mech interp. More fundamentally, just “being inspired by” is not a quantitative claim after all, and mech interp is not the central idea here anyway.
On the other hand, though, much of the (shallow) interp can help with capabilities more directly, especially on inference speed. Recent examples I can think of are Attention Sinks, Activation Sparsity, Deja Vu, and several parallel and follow-up works. (Sudden Drops also has some evidence on improving training dynamics using insights from developmental interp, though I think it’s somewhat weak.)
I have found this claim a bit handwavy, as I could imagine state space models being invented and improved to the current stage without the prior work of mech interp. More fundamentally, just “being inspired by” is not a quantitative claim after all, and mech interp is not the central idea here anyway.
On the other hand, though, much of the (shallow) interp can help with capabilities more directly, especially on inference speed. Recent examples I can think of are Attention Sinks, Activation Sparsity, Deja Vu, and several parallel and follow-up works. (Sudden Drops also has some evidence on improving training dynamics using insights from developmental interp, though I think it’s somewhat weak.)
Yeah, “strongest” doesn’t mean “strong” here!