nostalgebraist comments on the case for CoT unfaithfulness is overstated

nostalgebraist Sep 30, 2024, 5:27 PM
13 points
2
There are multiple examples of o1 producing gibberish in its COT summary (I won’t insert them right now because linking stuff from mobile is painful, will edit this comment later)
The o1 CoT summaries are not CoTs per se, and I agree with Joel Burget’s comment that we should be wary of drawing conclusions from them. Based on the system card, it sounds like they are generated by a separate “summarizer model” distinct from o1 itself (e.g. “we trained the summarizer model away from producing disallowed content”).
The summaries do indeed seem pretty janky. I’ve experienced it firsthand, here’s a fun example I saw. But I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
OpenAI has only released a few examples of true o1 CoTs, but I haven’t noticed any gibberish in these. Indeed, their most distinguishing characteristic is how humanlike they sound relative to ordinary HHH assistant CoTs, with lots of stuff like “hmm” and “interesting” and “wait a minute.”
As to your larger point, IIUC you are proposing 2 different reasons why CoTs might become incomprehensible or misleading under RL-like training:
1. The distant relationship between the per-example reward and the individual sampled tokens might result in a lot of weird stuff getting reinforced (because it appeared in a sample that eventually got the right answer, whether or not it helped), which will cause the policy to produce such “weird stuff” more often, creating an incentive for the policy to make that stuff do useful work for it.
2. If we reward shorter CoTs, then there is an incentive to be more efficient than human language if possible. (You say this in your final sentence, and IIUC this is also what you mean by “efficiency” in your first sentence?)
Re 1: given OpenAI’s interest in process supervision, seems likely that o1 was trained with supervision on the individual CoT steps, not just on the final answer.
And not even in the usual RL sense of using a value function learned together with the policy, but instead using the process supervision approach, where you supervise the steps with a model trained to evaluate their correctness in isolation, so that it penalizes them for being invalid even if it so happens that the current policy would correct them later on (or make other mistakes that cancel them later).
This seems less conducive to the dynamic you describe than ordinary RL (eg PPO). Steps don’t get reinforced just because they belong to CoTs that landed on the right answer eventually; to get reinforced, a step has to be convincing to a separate model (a “PRM”) which was trained separately on human or LLM judgments of local validity, and which won’t know about any “private language” the policy is trying to use to talk to itself.
This approach helps with capabilities in practice, which is why it gets used (and why I assume it was used in o1), but it also helps with interpretability, both in theory (previous paragraph) and possibly in practice (see e.g. discussion of trace error vs final answer error here). This in turn is part of why I’m skeptical that weirdness in o1 summaries reflects weirdness in the underlying CoT.
Re 2: I agree this mechanism exists. However, the policy supervision incentive just mentioned pushes against it; if the policy writes CoTs so compressed that others can’t understand them, a PRM won’t understand them and will penalize them.
I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs: o1 is already very verbose, and it seems like OpenAI wants to make it write even more (cf. that one tweet about making it think for hours or days on end). Meanwhile, ordinary HHH assistant LLMs tend to be needlessly verbose to the point where it’s become an infamous annoyance, which suggests that “we need to make our models write fewer tokens” is not currently a huge priority for OpenAI and co. (And the user-facing per-token cost of inference has been in free fall for a while.)
- quetzal_rainbow Sep 30, 2024, 6:24 PM
  6 points
  1
  Parent
  I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
  My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I’ve never seen models writing about anime in summary of physics paper.
  seems likely that o1 was trained with supervision on the individual CoT steps
  OpenAI directly says that they didn’t do that:
  We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
  On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it.
  I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs
  I agree that this may be true right now, the point is that you don’t need “special” incentives to get steganography.
  - CBiddulph Sep 30, 2024, 10:33 PM
    7 points
    8
    Parent
    > seems likely that o1 was trained with supervision on the individual CoT steps
    OpenAI directly says that they didn’t do that
    This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
- Seth Herd Sep 30, 2024, 6:35 PM
  4 points
  0
  Parent
  Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or “jargon”, it shouldn’t.
  
  Unfortunately, shorter CoTs save compute at this point.
  
  One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there’s a loss for longer CoTs, we have pressures that will obscure the true CoT if they’re used enough.
  
  I’d say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it).
  
  Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.