quetzal_rainbow comments on the case for CoT unfaithfulness is overstated

quetzal_rainbow 30 Sep 2024 5:05 UTC
8 points
1
I don’t think you need much incentives to develop steganography, only fact “human language is not platonic example of efficiency in conveying thoughts”.
There are multiple examples of o1 producing gibberish in its COT summary (EDIT: example 1, example 2, example 3 and there is actually more because I should make more bookmarks). Gibberish seems reasonable given description of training process, which optimizes model based only on final answer. I imagine a process like this:
1. Sometimes, model just produces gibberish in COT just because of lack of robustness
2. This gibberish gets reinforced
3. Model learns to utilize it just like it learns to utilize ”...” tokens
4. Continue process for a long enough time and you are going to get a lot of <untranslatable_14637> in COT, even if model doesn’t “try to hide” its reasoning. Also, I guess “gibberish resulting from non-robustness” is in itself more native format of thought for LLMs than imitating human thinking out loud.
The other factor of developing steganography is economic: if you optimize model to spend minimal number of tokens, you get model that “skips” or compress some thoughts.
- nostalgebraist 30 Sep 2024 17:27 UTC
  13 points
  2
  Parent
  There are multiple examples of o1 producing gibberish in its COT summary (I won’t insert them right now because linking stuff from mobile is painful, will edit this comment later)
  The o1 CoT summaries are not CoTs per se, and I agree with Joel Burget’s comment that we should be wary of drawing conclusions from them. Based on the system card, it sounds like they are generated by a separate “summarizer model” distinct from o1 itself (e.g. “we trained the summarizer model away from producing disallowed content”).
  The summaries do indeed seem pretty janky. I’ve experienced it firsthand, here’s a fun example I saw. But I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
  OpenAI has only released a few examples of true o1 CoTs, but I haven’t noticed any gibberish in these. Indeed, their most distinguishing characteristic is how humanlike they sound relative to ordinary HHH assistant CoTs, with lots of stuff like “hmm” and “interesting” and “wait a minute.”
  As to your larger point, IIUC you are proposing 2 different reasons why CoTs might become incomprehensible or misleading under RL-like training:
  1. The distant relationship between the per-example reward and the individual sampled tokens might result in a lot of weird stuff getting reinforced (because it appeared in a sample that eventually got the right answer, whether or not it helped), which will cause the policy to produce such “weird stuff” more often, creating an incentive for the policy to make that stuff do useful work for it.
  2. If we reward shorter CoTs, then there is an incentive to be more efficient than human language if possible. (You say this in your final sentence, and IIUC this is also what you mean by “efficiency” in your first sentence?)
  Re 1: given OpenAI’s interest in process supervision, seems likely that o1 was trained with supervision on the individual CoT steps, not just on the final answer.
  And not even in the usual RL sense of using a value function learned together with the policy, but instead using the process supervision approach, where you supervise the steps with a model trained to evaluate their correctness in isolation, so that it penalizes them for being invalid even if it so happens that the current policy would correct them later on (or make other mistakes that cancel them later).
  This seems less conducive to the dynamic you describe than ordinary RL (eg PPO). Steps don’t get reinforced just because they belong to CoTs that landed on the right answer eventually; to get reinforced, a step has to be convincing to a separate model (a “PRM”) which was trained separately on human or LLM judgments of local validity, and which won’t know about any “private language” the policy is trying to use to talk to itself.
  This approach helps with capabilities in practice, which is why it gets used (and why I assume it was used in o1), but it also helps with interpretability, both in theory (previous paragraph) and possibly in practice (see e.g. discussion of trace error vs final answer error here). This in turn is part of why I’m skeptical that weirdness in o1 summaries reflects weirdness in the underlying CoT.
  Re 2: I agree this mechanism exists. However, the policy supervision incentive just mentioned pushes against it; if the policy writes CoTs so compressed that others can’t understand them, a PRM won’t understand them and will penalize them.
  I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs: o1 is already very verbose, and it seems like OpenAI wants to make it write even more (cf. that one tweet about making it think for hours or days on end). Meanwhile, ordinary HHH assistant LLMs tend to be needlessly verbose to the point where it’s become an infamous annoyance, which suggests that “we need to make our models write fewer tokens” is not currently a huge priority for OpenAI and co. (And the user-facing per-token cost of inference has been in free fall for a while.)
  - quetzal_rainbow 30 Sep 2024 18:24 UTC
    6 points
    1
    Parent
    I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
    My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I’ve never seen models writing about anime in summary of physics paper.
    seems likely that o1 was trained with supervision on the individual CoT steps
    OpenAI directly says that they didn’t do that:
    We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
    On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it.
    I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs
    I agree that this may be true right now, the point is that you don’t need “special” incentives to get steganography.
    - CBiddulph 30 Sep 2024 22:33 UTC
      7 points
      8
      Parent
      > seems likely that o1 was trained with supervision on the individual CoT steps
      OpenAI directly says that they didn’t do that
      This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
  - Seth Herd 30 Sep 2024 18:35 UTC
    4 points
    0
    Parent
    Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or “jargon”, it shouldn’t.
    
    Unfortunately, shorter CoTs save compute at this point.
    
    One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there’s a loss for longer CoTs, we have pressures that will obscure the true CoT if they’re used enough.
    
    I’d say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it).
    
    Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.
- Htarlov 3 Oct 2024 22:34 UTC
  3 points
  0
  Parent
  Example 1 looks like a good part made in the wrong language. Examples 2 and 3 look like a bug making part of one user COT appear inside another user session.
  
  A possible explanation is that steps in COT are handled by the same instance of web service for multiple users (which is typical and usual practice) and the COT session ID being handled is a global variable instead of local or otherwise separated (f.ex. in a hashmap transaction id → data, if usage of globals is important for some other feature or requirement). So when sometimes two requests are handled simultaneously by multiple threads, one overwrites the data of the other one during processing and there is a mismatch when it saves the result. There might be a similar problem with the language variable. That is a sign of software being done quickly by less experienced developers instead of being well-thought and well-tested.
  
  Also, o1 COT is not the real COT. It is really a summary of parts of real COT made by another simpler model (maybe GPT 3.5 or 4o).
- Joel Burget 30 Sep 2024 15:11 UTC
  3 points
  2
  Parent
  I hesitate to draw any conclusions from the o1 CoT summary since it’s passed through a summarizing model.
  after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
  - quetzal_rainbow 30 Sep 2024 17:01 UTC
    2 points
    0
    Parent
    I agree that it is not as strong evidence as if we had access to original CoT, but I think that having deviations in CoT is more likely than summarizer fumbling that hard.
    - Joel Burget 7 Oct 2024 0:10 UTC
      1 point
      0
      Parent
      There are now two alleged instances of full chains of thought leaking (use an appropriate amount of spepticism), both of which seem coherent enough.
    - Joel Burget 1 Oct 2024 16:17 UTC
      1 point
      0
      Parent
      I think it’s more likely that this is just a (non-model) bug in ChatGPT. In the examples you gave, it looks like there’s always one step that comes completely out of nowhere and the rest of the chain of though would make sense without it. This reminds me of the bug where ChatGPT would show other users’ conversations.
- [ ]
  [deleted]