Could mech interp ever be as good as chain of thought?
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly—not exactly—explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn’t this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don’t think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
I very much agree. Do we really think we’re going to track a human-level AGI let alone a superintelligence’s every thought, and do it in ways it can’t dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don’t use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.
Could mech interp ever be as good as chain of thought?
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly—not exactly—explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn’t this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don’t think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
Seems like 99+ to me.
I very much agree. Do we really think we’re going to track a human-level AGI let alone a superintelligence’s every thought, and do it in ways it can’t dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don’t use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.