I very much agree. Do we really think we’re going to track a human-level AGI let alone a superintelligence’s every thought, and do it in ways it can’t dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don’t use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
I very much agree. Do we really think we’re going to track a human-level AGI let alone a superintelligence’s every thought, and do it in ways it can’t dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don’t use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.