David Scott Krueger (formerly: capybaralet) comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet) 1 Nov 2022 10:40 UTC
LW: 7 AF: 3
3
AF
we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine.
If this is true, then it makes (mechanistic) interpretability much harder as well, as we’ll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don’t think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly “We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception”