Some excerpts from my interview with Neel Nanda about how to productively carry out research in mechanistic interpretability.
Posting this here since I believe his advice is relevant for building accurate world models in general.
An Informal Definition Of Mechanistic Interpretability
It’s kind of this weird flavor of AI interpretability that says, “Bold hypothesis. Despite the entire edifice of established wisdom and machine learning, saying that these models are bullshit, inscrutable black boxes, I’m going to assume there is some actual structure here. But the structure is not there because the model wants to be interpretable or because it wants to be nice to me. The structure is there because the model learns an algorithm, and the algorithms that are most natural to express in the model’s structure and its particular architecture and stack of linear algebra are algorithms that make sense to humans. (context)
Three Modes Of Mechanistic Interpretability Research: Confirming, Red Teaming And Gaining Surface Area
I kind of feel a lot of my research style is dominated by this deep seated conviction that models are comprehensible and that everything is fundamentally kind of obvious and that I should be able to just go inside the model and there should be this internal structure. And so one mode of research is I just have all of these hypotheses and guesses about what’s going on. I generate experiment ideas for things that should be true if my hypothesis is true. And I just repeatedly try to confirm it.
Another mode of research is trying to red team and break things, where I have this hypothesis, I do this experiment, I’m like, “oh my God, this is going so well”, and then get kind of stressed because I’m concerned that I’m having wishful thinking and I try to break it and falsify it and come up with experiments that would show that actually life is complicated.
A third mode of research is what I call “trying to gain surface area” where I just have a system that I’m pretty confused about. I just don’t really know where to get started. Often, I’ll just go and do things that I think will get me more information. Just go and plot stuff or follow random things I’m curious about in a fairly undirected fuzzy way. This mode of research has actually been the most productive for me. [...]
You could paraphrase them as, “Isn’t it really obvious what’s going on?”, “Oh man, am I so sure about this?” and “Fuck around and find out”. (context)
Strong Beliefs Weakly Held: Having Hypotheses But Being Willing To Be Surprised
You can kind of think of it as “strong beliefs weakly held”. I think you should be good enough that you can start to form hypotheses, being at the point where you can sit down, set a five minute timer and brainstorm what’s going on and come up with four different hypotheses is just a much, much stronger research position than when you sit down and try to brainstorm and you come up with nothing. Yeah, maybe having two hypotheses is the best one. You want to have multiple hypotheses in mind.
You also want to be aware that probably both of them are wrong, but you want to have enough engagement with the problem that you can generate experiment ideas. Maybe one way to phrase it is if you don’t have any idea what’s going on, it’s hard to notice what’s surprising. And often noticing what’s surprising is one of the most productive things you can do when doing research. (context)
On The Benefits Of The Experimental Approach
I think there is a strong trend among people, especially the kind of people who get drawn to alignment from very theory based arguments to go and just pure theory craft and play around with toy models and form beautiful, elegant hypotheses about what happens in real models. [...] And there’s a kind of person who will write really detailed research proposals involving toy models that never has the step of like “go and make sure that this is actually what’s happening in the real language models we care about”. And I just think this is just a really crucial mistake that people often make. And real models are messy and ugly and cursed. So I vibe, but also you can’t just ignore the messy, complicated thing that’s the ultimate one we want to understand. And I think this is a mistake people often make.
The second thing is that, I don’t know, mechanistic interpretability seems hard and messy, but like it seems of embarrassing how little we’ve tried. And it would just be so embarrassing if we make AGI and it kills everyone and we could have interpreted it, we just didn’t try hard enough and didn’t know enough to get to the point where we could look inside it and see the ‘press here to kill everyone’ (context)
The learned algorithms will not always be simple enough to be interpretable. But I agree we should try to interpret as much as we can. What we are trying to predict is the behavior of future, more powerful models. I think toy models can sometimes have characteristics that are absent from current language models but those characteristics may be integrated into or emerge from more advanced systems that we build.