Self-Review: After a while of being insecure about it, I’m now pretty fucking proud of this paper, and think it’s one of the coolest pieces of research I’ve personally done. (I’m going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.
Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-accidentally) did a great job of picking a problem that people cared about for non-interp reasons, where mech interp was unusually easy (as it was a small model, on a clean algorithmic task), and that I was able to find real insights about grokking as a direct result of doing the mechanistic interpretability. Real models are fucking complicated (and even this model has some details we didn’t fully understand), but I feel great about the field having something that’s genuinely detailed, rigorous and successfully reverse-engineered, and this seems an important proof of concept. IMO the other contributions are the specific algorithm I found, and the specific insights about how and why grokking happens. but in my opinion these are much less interesting.
Field-Building Another large amount of impact is that this was a major win for mech interp field-building. This is hard to measure, but some data:
It’s got >100 citations in less than a year (a decent chunk of these are semi-fake citations from this being used as a standard citation for ‘mech interp exists as a field’, so I care more about the “how many papers would not exist without this” metric)
It went pretty viral on Twitter (1,000 likes, 300,000 views)
This was the joint first mech interp paper at a top ML conference (ICLR 23 spotlight) and seemed to get a lot of interest at the conference.
Anecdotally, I moderately often see people discussing this paper, or who know of me from this paper
At least one of my MATS scholars says he got into the field because of this work.
Is field-building good? It’s plausible to me that, on the margin, less alignment effort should be going into mech interp, and more should go into other agendas. But I’m still excited about mech interp field building, think this is a solid and high-value thing, and that field-building is often positive sum—people often have strong personal fits to different areas, and many people are drawn to it from non-alignment motivations. And though there’s concern over capabilities externalities, my guess is that good interp work is strongly net good.
Is it worth publishing in academic venues Submitting this to ICLR was my first serious experience with peer review. I’m not sure what updates I’ve made re whether this was worth it. I think some, but probably <50% of the field-building benefit came from this, and that going Twitter viral was much more important for ensuring people became aware of the work. I think it made the work more legitimate-seeming, more citable, more respectable, etc. On an object level, it resulted in the writing, ideas and experiments becoming much better and clearer (though led to some of the more interesting speculation being relegated to appendix E :‘( ) though this was largely due to /u/lawrencec ’s skills. I definitely find peer review/conforming to academic conventions very tedious and irritating, and am very grateful to Lawrence for doing almost all of the work.
Personal benefit It’s hard to measure, but I think this turned out to be substantially good for my career, reputation and influence. It’s often been what people know me for, and I think has helped me snowball into other career successes.
Ways it’s overrated As noted, I do think there’s ways people overrate the results/ideas here, and that there’s too much interest in the object level results (the Fourier Multiplication algorithm, and understanding grokking). Some thoughts:
The specific techniques I used to understand the model are super specific to modular addition + Fourier stuff, and not very useful on real models (though I hold out hope that it’ll be relevant to how language models do addition!)
People often think grokking is a key insight about how language models learn, I think this can be misleading. Grokking is a fragile and transitionary state you get when your hyper-parameters are just right (a bit more or less data makes it memorise or generalise immediately) and requires a ton of overtraining on the same data gain and again. I think grokking gives some hints that individual circuits may be learned in sudden phase transitions (the quantization model of neural scaling), but we need much more evidence from real models on these questions. And something like “true reasoning” is plausibly a mix of many circuits, each with their own phase transition, rather than a thing that’ll be suddenly grokked.
People often underestimate the difficulty jump in interpreting real models (even a 1L language model) compared to the modular addition model, and get too excited about how easy it’ll be
People also get excited about more algorithmic interp work. I think this is largely played out, and focus on real models in my own work (and the work I supervise). I ultimately care about reverse-engineering AGI, and I think language models (even small ones) are decent proxies for this, while algorithmic problem models are not, unless you set up a really good one, that captures some property of real models that we care about. And I’m unconvinced there’s much more marginal value in demonstrating that algorithmic mech interp is possible.
Self-Review: After a while of being insecure about it, I’m now pretty fucking proud of this paper, and think it’s one of the coolest pieces of research I’ve personally done. (I’m going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.
Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-accidentally) did a great job of picking a problem that people cared about for non-interp reasons, where mech interp was unusually easy (as it was a small model, on a clean algorithmic task), and that I was able to find real insights about grokking as a direct result of doing the mechanistic interpretability. Real models are fucking complicated (and even this model has some details we didn’t fully understand), but I feel great about the field having something that’s genuinely detailed, rigorous and successfully reverse-engineered, and this seems an important proof of concept. IMO the other contributions are the specific algorithm I found, and the specific insights about how and why grokking happens. but in my opinion these are much less interesting.
Field-Building Another large amount of impact is that this was a major win for mech interp field-building. This is hard to measure, but some data:
There are multiple papers I like that are substantially building on/informed by these results (A toy model of universality, the clock and the pizza, Feature emergence via margin maximization, Explaining grokking through circuit efficiency
It’s got >100 citations in less than a year (a decent chunk of these are semi-fake citations from this being used as a standard citation for ‘mech interp exists as a field’, so I care more about the “how many papers would not exist without this” metric)
It went pretty viral on Twitter (1,000 likes, 300,000 views)
This was the joint first mech interp paper at a top ML conference (ICLR 23 spotlight) and seemed to get a lot of interest at the conference.
Anecdotally, I moderately often see people discussing this paper, or who know of me from this paper
At least one of my MATS scholars says he got into the field because of this work.
Is field-building good? It’s plausible to me that, on the margin, less alignment effort should be going into mech interp, and more should go into other agendas. But I’m still excited about mech interp field building, think this is a solid and high-value thing, and that field-building is often positive sum—people often have strong personal fits to different areas, and many people are drawn to it from non-alignment motivations. And though there’s concern over capabilities externalities, my guess is that good interp work is strongly net good.
Is it worth publishing in academic venues Submitting this to ICLR was my first serious experience with peer review. I’m not sure what updates I’ve made re whether this was worth it. I think some, but probably <50% of the field-building benefit came from this, and that going Twitter viral was much more important for ensuring people became aware of the work. I think it made the work more legitimate-seeming, more citable, more respectable, etc. On an object level, it resulted in the writing, ideas and experiments becoming much better and clearer (though led to some of the more interesting speculation being relegated to appendix E :‘( ) though this was largely due to /u/lawrencec ’s skills. I definitely find peer review/conforming to academic conventions very tedious and irritating, and am very grateful to Lawrence for doing almost all of the work.
Personal benefit It’s hard to measure, but I think this turned out to be substantially good for my career, reputation and influence. It’s often been what people know me for, and I think has helped me snowball into other career successes.
Ways it’s overrated As noted, I do think there’s ways people overrate the results/ideas here, and that there’s too much interest in the object level results (the Fourier Multiplication algorithm, and understanding grokking). Some thoughts:
The specific techniques I used to understand the model are super specific to modular addition + Fourier stuff, and not very useful on real models (though I hold out hope that it’ll be relevant to how language models do addition!)
People often think grokking is a key insight about how language models learn, I think this can be misleading. Grokking is a fragile and transitionary state you get when your hyper-parameters are just right (a bit more or less data makes it memorise or generalise immediately) and requires a ton of overtraining on the same data gain and again. I think grokking gives some hints that individual circuits may be learned in sudden phase transitions (the quantization model of neural scaling), but we need much more evidence from real models on these questions. And something like “true reasoning” is plausibly a mix of many circuits, each with their own phase transition, rather than a thing that’ll be suddenly grokked.
People often underestimate the difficulty jump in interpreting real models (even a 1L language model) compared to the modular addition model, and get too excited about how easy it’ll be
People also get excited about more algorithmic interp work. I think this is largely played out, and focus on real models in my own work (and the work I supervise). I ultimately care about reverse-engineering AGI, and I think language models (even small ones) are decent proxies for this, while algorithmic problem models are not, unless you set up a really good one, that captures some property of real models that we care about. And I’m unconvinced there’s much more marginal value in demonstrating that algorithmic mech interp is possible.