We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the modeland understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
This section is interesting in a few ways:
‘Assuming it is faithful and legible’—we have reason to believe that it’s not, at least not on previous models, as they surely know. Do they have reason to believe that it is for o1, or are they just ignoring that issue?
‘we cannot train any policy compliance or user preferences onto the chain of thought’—sure, legit. Although LLM experiments that use a “hidden” CoT or scratchpad may already show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. If they don’t yet, I expect they will soon.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’ But that raises questions.
Is this behavior they’re actually seeing in the model? It’s not obvious to me that you’d expect it to happen. If they’re avoiding doing RL based on the CoT contents, then certainly it could happen, but it seems like it would be strictly more complex behavior, and so not very likely to spontaneously emerge.
Although I can also imagine a story where there’d be pressure for it to emerge. Can the model reason more clearly if it has the opportunity to think uncensored thoughts?
But also ‘for the o1 model series we show a model-generated summary of the chain of thought.’ It seems strange to spend a lot more forward passes to summarize the CoT as opposed to just doing a single pass through a model trained to detect content that violates policy and omitting the CoT if that triggers.
In addition to the previous justifications, they cite ‘user experience’ and ‘competitive advantage’. The former seems silly at first blush; how will users’ experience be negatively affected by a CoT that’s hidden by default and that they never need to look at? I’m curious about what sort of ‘competitive advantage’ they’re talking about. Maybe the CoT would reveal a highly-structured system prompt for how to do CoT that accounts for a lot of the decreased loss on reasoning tasks?
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful.
Agreed!
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is mucheasier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
when COTs are sampled from these language models in OOD domains, misgeneralization is expected.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
Elsewhere@Wei Dai points out the apparent conflict between ‘we cannot train any policy compliance or user preferences onto the chain of thought’ (above) and the following from the Safety section (emphasis mine):
We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context...
Thoughts on a passage from OpenAI’s GPT-o1 post today:
This section is interesting in a few ways:
‘Assuming it is faithful and legible’—we have reason to believe that it’s not, at least not on previous models, as they surely know. Do they have reason to believe that it is for o1, or are they just ignoring that issue?
‘we cannot train any policy compliance or user preferences onto the chain of thought’—sure, legit. Although LLM experiments that use a “hidden” CoT or scratchpad may already show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. If they don’t yet, I expect they will soon.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’ But that raises questions.
Is this behavior they’re actually seeing in the model? It’s not obvious to me that you’d expect it to happen. If they’re avoiding doing RL based on the CoT contents, then certainly it could happen, but it seems like it would be strictly more complex behavior, and so not very likely to spontaneously emerge.
Although I can also imagine a story where there’d be pressure for it to emerge. Can the model reason more clearly if it has the opportunity to think uncensored thoughts?
But also ‘for the o1 model series we show a model-generated summary of the chain of thought.’ It seems strange to spend a lot more forward passes to summarize the CoT as opposed to just doing a single pass through a model trained to detect content that violates policy and omitting the CoT if that triggers.
In addition to the previous justifications, they cite ‘user experience’ and ‘competitive advantage’. The former seems silly at first blush; how will users’ experience be negatively affected by a CoT that’s hidden by default and that they never need to look at? I’m curious about what sort of ‘competitive advantage’ they’re talking about. Maybe the CoT would reveal a highly-structured system prompt for how to do CoT that accounts for a lot of the decreased loss on reasoning tasks?
Copying a comment on this from @Buck elsewhere that seems pretty plausible:
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
Agreed!
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
My understanding is something like:
OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is much easier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
Elsewhere @Wei Dai points out the apparent conflict between ‘we cannot train any policy compliance or user preferences onto the chain of thought’ (above) and the following from the Safety section (emphasis mine):