we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine.
If this is true, then it makes (mechanistic) interpretability much harder as well, as we’ll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don’t think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly “We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception”
If this is true, then it makes (mechanistic) interpretability much harder as well, as we’ll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don’t think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly “We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception”