I agree that CoT faithfulness isn’t something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.
I agree that CoT faithfulness isn’t something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.
Happy to discuss this further, but have you read e.g. https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated