This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability’s theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here’s a few points where I think we differ:
I am really excited about fundamental, low-level questions. If they let me I’d want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
I’m also bullish on automated interpretability conditional on some more theoretical advances.
I think I mostly agree, but am going to clarify a little bit:
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I’m also bullish on automated interpretability conditional on some more theoretical advances.
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this.
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability’s theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.
Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here’s a few points where I think we differ:
I am really excited about fundamental, low-level questions. If they let me I’d want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
I’m also bullish on automated interpretability conditional on some more theoretical advances.
Best of luck with your future research!
Thanks for the kind words!
I think I mostly agree, but am going to clarify a little bit:
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
Thanks again! You too :)
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
I’m putting some of my faith in low-rank decompositions of bilinear MLPs but I’ll let you know if I make any real progress with it :)