But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.