CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful.
Agreed!
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
Agreed!
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.