I was puzzled by that latter section (my thoughts in shortform here). Buck suggests that it may be mostly a smokescreen around ‘We don’t want to show the CoT because competitors would fine-tune on it’.
Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.
That’s my guess (without any inside information): the model knows the safety rules and can think-out-loud about them in the CoT (just as it can think about anything else) but they’re not fine-tuning on CoT content for ‘policy compliance or user preferences’.
I was puzzled by that latter section (my thoughts in shortform here). Buck suggests that it may be mostly a smokescreen around ‘We don’t want to show the CoT because competitors would fine-tune on it’.
That’s my guess (without any inside information): the model knows the safety rules and can think-out-loud about them in the CoT (just as it can think about anything else) but they’re not fine-tuning on CoT content for ‘policy compliance or user preferences’.
This is also how I interpreted it.