IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there’s a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would’ve said that we’d never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda’s modular addition work.
I don’t think I’ve seen many people be surprised here, and indeed, at least in my model of the world interpretability is progressing slower than I was hoping for/expecting (like, when I saw the work by Chris Olah 6 years ago, I had hope we would make real progress understanding how these systems think, and lots of people would end up being productively able to contribute to the field, but our understand has IMO barely kept up with the changing architectures of the field, and is extremely far from being able to say really much of anything definite about how these models do any significant fraction of what they do, and very few people outside of Chris Olah’s team seem to have made useful progress).
I would be interested if you can dig up any predictions by people who predicted much slower progress on interpretability. I don’t currently believe that many people are surprised by current tractability in the space (I do think there is a trend for people who are working on interpretability to feel excited by their early work, but I think the incentives here are too strong for me to straightforwardly take someone’s word for it, though it’s still evidence).
I have seen one person be surprised (I think twice in the same convo) about what progress had been made.
ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.
I don’t think I’ve seen many people be surprised here, and indeed, at least in my model of the world interpretability is progressing slower than I was hoping for/expecting (like, when I saw the work by Chris Olah 6 years ago, I had hope we would make real progress understanding how these systems think, and lots of people would end up being productively able to contribute to the field, but our understand has IMO barely kept up with the changing architectures of the field, and is extremely far from being able to say really much of anything definite about how these models do any significant fraction of what they do, and very few people outside of Chris Olah’s team seem to have made useful progress).
I would be interested if you can dig up any predictions by people who predicted much slower progress on interpretability. I don’t currently believe that many people are surprised by current tractability in the space (I do think there is a trend for people who are working on interpretability to feel excited by their early work, but I think the incentives here are too strong for me to straightforwardly take someone’s word for it, though it’s still evidence).
I have seen one person be surprised (I think twice in the same convo) about what progress had been made.
ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.