My favorite detail is that they aren’t training the hidden chain of thought to obey their content policy. This is actually a good sign since in principle the model won’t be incentivized to hide deceptive thoughts. Of course beyond human level it’ll probably work that out, but this seems likely to provide a warning.
I’m actually pretty confused about what they did exactly. From the Safety section of Learning to Reason with LLMs:
Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model’s safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
from Hiding the Chains of Thought:
For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
These two sections seem to contradict each other but I can also think of ways to interpret them to be more consistent. (Maybe “don’t train any policy compliance or user preferences onto the chain of thought” is a potential future plan, not what they already did. Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.)
Does anyone know more details about this, and also about the reinforcement learning that was used to train o1 (what did they use as a reward signal, etc.)? I’m interested to understand how alignment in practice differs from theory (e.g. IDA), or if OpenAI came up with a different theory, what its current alignment theory is.
I was puzzled by that latter section (my thoughts in shortform here). Buck suggests that it may be mostly a smokescreen around ‘We don’t want to show the CoT because competitors would fine-tune on it’.
Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.
That’s my guess (without any inside information): the model knows the safety rules and can think-out-loud about them in the CoT (just as it can think about anything else) but they’re not fine-tuning on CoT content for ‘policy compliance or user preferences’.
I suspect the reason for hiding the chain of thought is some blend of:
a) the model might swear or otherwise do a bad thing, hold a discussion with itself, and then decide it shouldn’t have done that bad thing, and they’re more confident that they can avoid the bad thing getting into the summary than that they can backtrack and figure out exactly which part of the CoT needs to be redacted, and
b) they don’t want other people (especially open-source fine-tuners) to be able to fine-tine on their CoT and distill their very-expensive-to-train reasoning traces
I will be interested to see how fast jailbreakers make headway on exposing either a) or b)
My favorite detail is that they aren’t training the hidden chain of thought to obey their content policy. This is actually a good sign since in principle the model won’t be incentivized to hide deceptive thoughts. Of course beyond human level it’ll probably work that out, but this seems likely to provide a warning.
I’m actually pretty confused about what they did exactly. From the Safety section of Learning to Reason with LLMs:
from Hiding the Chains of Thought:
These two sections seem to contradict each other but I can also think of ways to interpret them to be more consistent. (Maybe “don’t train any policy compliance or user preferences onto the chain of thought” is a potential future plan, not what they already did. Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.)
Does anyone know more details about this, and also about the reinforcement learning that was used to train o1 (what did they use as a reward signal, etc.)? I’m interested to understand how alignment in practice differs from theory (e.g. IDA), or if OpenAI came up with a different theory, what its current alignment theory is.
I was puzzled by that latter section (my thoughts in shortform here). Buck suggests that it may be mostly a smokescreen around ‘We don’t want to show the CoT because competitors would fine-tune on it’.
That’s my guess (without any inside information): the model knows the safety rules and can think-out-loud about them in the CoT (just as it can think about anything else) but they’re not fine-tuning on CoT content for ‘policy compliance or user preferences’.
This is also how I interpreted it.
I suspect the reason for hiding the chain of thought is some blend of:
a) the model might swear or otherwise do a bad thing, hold a discussion with itself, and then decide it shouldn’t have done that bad thing, and they’re more confident that they can avoid the bad thing getting into the summary than that they can backtrack and figure out exactly which part of the CoT needs to be redacted, and
b) they don’t want other people (especially open-source fine-tuners) to be able to fine-tine on their CoT and distill their very-expensive-to-train reasoning traces
I will be interested to see how fast jailbreakers make headway on exposing either a) or b)