I think I mostly agree, but am going to clarify a little bit:
Similar to above, but I can’t help but feel like I can’t ever truly be “at ease” with AIs unless we can understand them at a level deeper than what you’ve sketched out above.
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
I have some vague hypothesis about what a better paradigm for mech interp could be. It’s probably wrong, but at least I should try look into it more.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I’m also bullish on automated interpretability conditional on some more theoretical advances.
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this.
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
Thanks for the kind words!
I think I mostly agree, but am going to clarify a little bit:
I also believe this (I was previously interested in formal verification). I’ve just kind of given up on this ever being achieved haha. It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023.
That’s pretty neat! And I broadly agree that this is what’s going on. The problem (as I see it) is that it doesn’t have any predictive power, since it’s near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).
I sincerely want auto-interp people to keep doing what they’re doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren’t currently heavily invested in pushing mech interp then you probably shouldn’t invest more.)
Thanks again! You too :)
Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.
I’m putting some of my faith in low-rank decompositions of bilinear MLPs but I’ll let you know if I make any real progress with it :)