Joel Burget comments on the case for CoT unfaithfulness is overstated

Joel Burget 30 Sep 2024 15:11 UTC
3 points
2
I hesitate to draw any conclusions from the o1 CoT summary since it’s passed through a summarizing model.
after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
- quetzal_rainbow 30 Sep 2024 17:01 UTC
  2 points
  0
  Parent
  I agree that it is not as strong evidence as if we had access to original CoT, but I think that having deviations in CoT is more likely than summarizer fumbling that hard.
  - Joel Burget 7 Oct 2024 0:10 UTC
    1 point
    0
    Parent
    There are now two alleged instances of full chains of thought leaking (use an appropriate amount of spepticism), both of which seem coherent enough.
  - Joel Burget 1 Oct 2024 16:17 UTC
    1 point
    0
    Parent
    I think it’s more likely that this is just a (non-model) bug in ChatGPT. In the examples you gave, it looks like there’s always one step that comes completely out of nowhere and the rest of the chain of though would make sense without it. This reminds me of the bug where ChatGPT would show other users’ conversations.