Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.
Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.