I think CoT faithfulness was a goal, a hope, that had yet to be realized. People were assuming it was there in many cases when it wasn’t.
You can see the cracks showing in many places. For example, editing the CoT to be incorrect and noting that the model still puts the same correct answer. Or observing situations where the CoT was incorrect to begin with, and yet the answer was correct.
Those “private scratchpads”? Really? How sure are you that the model was “fooled” by them? What evidence do you have that this is the case? I think the default assumption has to be that the model sees a narrative story element which predicts a certain type of text, and thus puts the type of text there that it expects belongs there. That doesn’t mean you now have true insight into the computational generative process of the model!
I think CoT faithfulness was a goal, a hope, that had yet to be realized. People were assuming it was there in many cases when it wasn’t.
You can see the cracks showing in many places. For example, editing the CoT to be incorrect and noting that the model still puts the same correct answer. Or observing situations where the CoT was incorrect to begin with, and yet the answer was correct.
Those “private scratchpads”? Really? How sure are you that the model was “fooled” by them? What evidence do you have that this is the case? I think the default assumption has to be that the model sees a narrative story element which predicts a certain type of text, and thus puts the type of text there that it expects belongs there. That doesn’t mean you now have true insight into the computational generative process of the model!