David Scott Krueger (formerly: capybaralet) comments on Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet) 3 Nov 2022 23:51 UTC
LW: 4 AF: 1
2
AF
I don’t really buy this argument.
- I think the following is a vague and slippery concept: “a kind of generalized reverse-engineering/interpretability skill”. But I agree that you would want to do testing, etc. of any system before you deploy it.
- It seems like the ambitious goal of mechanistic interpretability, which would get you the kind of safety properties we are after, would indeed require explaining how the entire process works. But when we are talking about such a complex system, it seems the main obstacle to understanding for either approach is our ability to comprehend such an explanation. I don’t see a reason to say that we can surmount that obstacle more easily via reverse engineering than via engineering. It often seems to me that people are assuming that mechanistic interpretability addresses this obstacle (I’m skeptical), or that (effectively) the obstacle doesn’t actually exist (in which case why can’t we just do it via engineering?)