Further sketching out what ‘prosaic interpretability’ could look like.
Note; in the parlance below ‘model’ refers to a theory / idea / intuition, not an LLM
In my current vision, ‘prosaic interpretability’ will stitch together a swathe of different empirical results into a broader fabric of ‘LLM behavioural science’.
My current view is that most people who do empirical alignment research (like Ethan Perez’s team) already have a collection of such intuitions that drive experiment design and hypothesis formulation, but mostly don’t publish these (e.g. because they are less defensible), and prefer to focus on publishing object-level findings.
The goal is to be “nimble”; constantly adapt and update. This seems like the ~only way to keep pace with the quickly-evolving reality of frontier AI systems.
Empiricism will be a key tenet of this kind of work. Shamelessly steal and adopt any model that makes good predictions, ruthlessly discard any model that doesn’t. Systematized winning, applied at scale to the fundamental questions in interpretability.
Extensive theorising should be considered suspect by default. Complex models are more likely to be wrong. Also, developing extensive theory seems unnecessary if / when everything can change in the span of months. Ultimately, a history of science shows that the best models are simple ones.
This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability’s theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here’s a few points where I think we differ:
I am really excited about fundamental, low-level questions. If they let me I’d want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
I’m also bullish on automated interpretability conditional on some more theoretical advances.
I think I mostly agree, but am going to clarify a little bit:
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I’m also bullish on automated interpretability conditional on some more theoretical advances.
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this.
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
Further sketching out what ‘prosaic interpretability’ could look like.
Note; in the parlance below ‘model’ refers to a theory / idea / intuition, not an LLM
In my current vision, ‘prosaic interpretability’ will stitch together a swathe of different empirical results into a broader fabric of ‘LLM behavioural science’.
This will in turn yield gearsy models like Jan Kulveit’s three layer model of LLM behavioural science or the Waluigi effect. Elucidating and refining these ‘gearsy models’ are the explicit goal of prosaic interpretability.
My current view is that most people who do empirical alignment research (like Ethan Perez’s team) already have a collection of such intuitions that drive experiment design and hypothesis formulation, but mostly don’t publish these (e.g. because they are less defensible), and prefer to focus on publishing object-level findings.
The goal is to be “nimble”; constantly adapt and update. This seems like the ~only way to keep pace with the quickly-evolving reality of frontier AI systems.
Empiricism will be a key tenet of this kind of work. Shamelessly steal and adopt any model that makes good predictions, ruthlessly discard any model that doesn’t. Systematized winning, applied at scale to the fundamental questions in interpretability.
Extensive theorising should be considered suspect by default. Complex models are more likely to be wrong. Also, developing extensive theory seems unnecessary if / when everything can change in the span of months. Ultimately, a history of science shows that the best models are simple ones.
This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability’s theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here’s a few points where I think we differ:
I am really excited about fundamental, low-level questions. If they let me I’d want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
I’m also bullish on automated interpretability conditional on some more theoretical advances.
Best of luck with your future research!
Thanks for the kind words!
I think I mostly agree, but am going to clarify a little bit:
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
Thanks again! You too :)
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
I’m putting some of my faith in low-rank decompositions of bilinear MLPs but I’ll let you know if I make any real progress with it :)