I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I’ve never seen models writing about anime in summary of physics paper.
seems likely that o1 was trained with supervision on the individual CoT steps
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the modeland understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it.
I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs
I agree that this may be true right now, the point is that you don’t need “special” incentives to get steganography.
> seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn’t do that
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I’ve never seen models writing about anime in summary of physics paper.
OpenAI directly says that they didn’t do that:
On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it.
I agree that this may be true right now, the point is that you don’t need “special” incentives to get steganography.
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.