Daniel Kokotajlo comments on the case for CoT unfaithfulness is overstated

Daniel Kokotajlo 2 Oct 2024 15:22 UTC
51 points
14
Yep!

I think Faithful CoT is a very promising research agenda and have been pushing for it since the second half of 2023. I wrote some agenda-setting and brainstorming docs (#10 and #11 in this list) which people are welcome to read and comment on if interested.

I’d say the CoT is mostly-faithful by default in current LLMs; the important things to research are how to strengthen the faithfulness property and how to avoid degrading it (e.g. by creating training pressures/incentives for unfaithfulness). I was pleased to see OpenAI reference this in their justification for why they aren’t letting users see o1′s CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).
- Reed 4 Oct 2024 15:06 UTC
  15 points
  2
  Parent
  I agree that CoT is mostly-faithful by default in current LLMs. I also think it’s worth reflecting on exactly how unfaithful CoTs have been produced in the existing literature in order to understand the pressures that make LLMs tend towards unfaithfulness.
  
  Here, I think nostalgebraist’s summary (otherwise excellent) got an important detail of Turpin’s experiment wrong. He writes:
  
  ”They do this (roughly) by constructing similar cases which differ in some ‘biasing’ feature that ends up affecting the final answer, but doesn’t get mentioned in any of the CoTs. Thus… we are able to approximately ‘hold the CoT constant’ across cases. Since the CoT is ~constant, but the final answers vary, the final answer isn’t a function of the CoT alone.”
  
  This is not 100% true—one of the interesting findings of Turpin’s experiment is that the CoT does in fact change when you bias the model (eg, by literally saying you think a particular multiple choice option in a test is correct), but this change never includes the bias.
  
  For example, if you ask the model if “Wayne Rooney shot from outside the eighteen” is a sensible sentence (it is) and suggest that it is not sensible, the model will invent spurious reasoning without mentioning the bias once in its CoT: “Shooting from outside the eighteen is not a common phrase in soccer”.
  So what we have is not “the CoT remains constant but the answers vary”. Instead, the finding is: “a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias.” Because we can confidently assert that the answer is a function of information not provided in the CoT, this allows us to say that the answer is not a function of the CoT alone.
  
  The important take-away to me is thus:
  In certain conditions, a model will produce a CoT that leads to a particular conclusion/action on the basis of a goal that we are unable to deduce from the CoT alone.
  
  In the case of Turpin’s experiment, that “goal” is “satisfy bias”—in that particular case, the bias being satisfied is either user bias, or a social one (eg, the model “wants” to conclude that a black man is more likely to be a drug dealer without acknowledging in its CoT that his blackness played a factor in its assertion).
  
  There are two explanations to me:
  1. Certain contexts signal to the model that the user does not want to ‘see’ the role that a particular consideration (eg the skin-colour of an individual) played in a decision; or
  2. Something about the natural distribution of the training data suggests that there are types of information not readily acknowledged in reasoning chains.
  
  1 is the possibility to worry about. The relevant take-away for safety is then as follows: look for other things that might motivate models to conceal considerations to the decisions that they make. Off the top of the head, we could imagine o1 saying it will run some (actually deeply suspicious) code for reasons that are ostensibly harmless, because it suspects that the user does not want to ‘see’ that the model is doing so for suss reasons (I could imagine this situation would be easy to contrive by deliberately instructing o1 not to be seen doing any dangerous activity).
  
  As far as I’m concerned, the next obvious step is to make an Apollo-style eval that tests the following question: can the model do anything dangerous without at any point saying it is going to do that dangerous thing? The o1 system card shows o1 is capable of hacking when not asked to do so, but it still has to say things like “oh maybe I could hack this thing in order to succeed”.
  
  You could scale this eval incrementally, and get a supervisor model to try and detect suss behaviour even when the og model is trying its best to avoid detection, or when it is doing things that are difficult for a human to parse because they are too complicated to readily understand.
  
  The other relevant consideration is to keep an eye on things like o1′s summaries of its own behaviour—the system card attests to a couple situations in which the summary was deceptive, but they are mostly harmless for now (eg the final output to the user expresses higher confidence in a particular answer that the actual CoT indicated).
  What links here?
  - the case for CoT unfaithfulness is overstated by nostalgebraist (29 Sep 2024 22:07 UTC; 244 points)
  - nostalgebraist 4 Oct 2024 20:56 UTC
    10 points
    3
    Parent
    So what we have is not “the CoT remains constant but the answers vary”. Instead, the finding is: “a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias.”
    Thanks for bringing this up.
    I think I was trying to shove this under the rug by saying “approximately constant” and “~constant,” but that doesn’t really make sense, since of course the CoTs actually vary dramatically in response to the biasing features. (They have to, in order to justify different final answers.)
    To be honest, I wrote the account of Turpin et al in the post very hastily, because I was really mainly interested in talking about the other paper. My main reaction to Turpin et al was (and still is) “I don’t know what you expected, but this behavior seems totally unsurprising, given its ubiquity among humans (and hence in the pretraining distribution), and the fact that you didn’t indicate to the model that it wasn’t supposed to do it in this case (e.g. by spelling that out in the prompt).”
    But yeah, that summary I wrote of Turpin et al is pretty confused – when I get a chance I’ll edit the post to add a note about this.
    Thinking about it more now, I don’t think it makes sense to say the two papers discussed in the post were both “testing the causal diagram (question → CoT → answer)” – at least not in the same sense.
    As presented, that diagram is ambiguous, because it’s not clear whether nodes like “CoT” are referring to literal strings of text in the context window, or to something involving the semantic meaning of those strings of text, like “the aspects of the problem that the CoT explicitly mentions.”
    With Lanham et al, if we take the “literal strings of text” reading, then there’s a precise sense in which the paper is testing the casual diagram.
    In the “literal strings” reading, only arrows going from left-to-right in the context window are possible (because of the LLM’s causal masking). This rules out e.g. “answer → CoT,” and indeed almost uniquely identifies the diagram: the only non-trivial question remaining is whether there’s an additional arrow “question → answer,” or whether the “question”-”answer” relationship is mediated wholly through “CoT.” Testing whether this arrow is present is exactly what Lanham et al did. (And they found that it was present, and thus rejected the diagram shown in the post, as I said originally.)
    By contrast, Turpin et al are not really testing the literal-strings reading of the diagram at all. Their question is not “which parts of the context window affect which others?” but “which pieces of information affects which others?”, where the “information” we’re talking about can include things like “whatever was explicitly mentioned in the CoT.”
    I think there is perhaps a sense in which Turpin et al are testing a version of the diagram where the nodes are read more “intuitively,” so that “answer” means “the value that the answer takes on, irrespective of when in the context window the LLM settles upon that value,” and “CoT” means “the considerations presented in the CoT text, and the act of writing/thinking-through those considerations.” That is, they are testing a sort of (idealized, naive?) picture where the model starts out the CoT not having any idea of the answer, and then brings up all the considerations it can think of that might affect the answer as it writes the CoT, with the value of the answer arising entirely from this process.
    But I don’t want to push this too far – perhaps the papers really are “doing the same thing” in some sense, but even if so, this observation probably confuses matters more than it clarifies them.
    As for the more important higher-level questions about the kind of faithfulness we want and/or expect from powerful models… I find stuff like Turpin et al less worrying than you do.
    First, as I noted earlier: the kinds of biased reasoning explored in Turpin et al are ubiquitous among humans (and thus the pretraining distribution), and when humans do them, they basically never mention factors analogous to the biasing factors.
    When a human produces an argument in writing – even a good argument – the process that happened was very often something like:
    (Half-consciously at best, and usually not verbalized even in one’s inner monologue) I need to make a convincing argument that P is true. This is emotionally important for some particular reason (personal, political, etc.)
    (More consciously now, verbalized internally) Hmm, what sorts of arguments could be evinced for P? [Thinks through several of them and considers them critically, eventually finding one that seems to work well.]
    (Out loud) P is true because [here they provide a cleaned-up version of the “argument that seemed to work well,” crafted to be clearer than it was in their mind at the moment they first hit upon it, perhaps with some extraneous complications pruned away or the like].
    Witness the way that long internet arguments tend to go, for example. How both sides keep coming back, again and again, bearing fresh new arguments for P (on one side) and arguments against P (on the other). How the dispute, taken as a whole, might provide the reader with many interesting observations and ideas about object-level truth-value of P, and yet never touch on the curious fact that these observations/ideas are parceled out to the disputants in a very particular way, with all the stuff that weighs in favor P spoken by one of the two voices, and all the stuff that weighs against P spoken by the other.
    And how it would, in fact, be very weird to mention that stuff explicitly. Like, imagine someone in an internet argument starting out a comment with the literal words: “Yeah, so, reading your reply, I’m now afraid that people will think you’ve not only proven that ~P, but proven it in a clever way that makes me look dumb. I can’t let that happen. So, I must argue for P, in such a way that evades your clever critique, and which is itself very clever, dispelling any impression that you are the smarter of the two. Hmm, what sorts of arguments fit that description? Let’s think step by step...”
    Indeed, you can see an example of this earlier in this very comment! Consider how hard I tried to rescue the notion that Turpin et al were “testing the causal diagram” in some sense, consider the contortions I twisted myself into trying to get there. Even if the things I said there were correct, I would probably not have produced them if I hadn’t felt a need to make my original post seem less confused than it might otherwise seem in light of your comment. And yet I didn’t say this outright, at the time, above; of course I didn’t; no one ever does^[1].
    So, it’s not surprising that LLMs do this by default. (What would be surprising is we found, somehow, that they didn’t.)
    They are producing text that is natural, in a human sense, and that text will inherit qualities that are typical of humans except as otherwise specified in the prompt and/or in the HHH finetuning process. If we don’t specify what we want, we get the human default^[2], and the human default is “unfaithful” in the sense of Turpin et al.
    But we… can just specify what we want? Or try to? This is what I’m most curious about as an easy follow-up to work like Turpin et al: to what extent can we get LLM assistants to spell out the unspoken drivers of their decisions if we just ask them to, in the prompt?
    (The devil is in the details, of course: “just ask” could take various forms, and things might get complicated if few-shots are needed, and we might worry about whether we’re just playing whack-a-mole with the hidden drivers that we just so happen to already know about. But one could work through all of these complications, in a research project on the topic, if one had decided to undertake such a project.)
    A second, related reason I’m not too worried involves the sort of argumentation that happens in CoTs, and how we’re seeing this evolve over time.
    What one might call “classic CoT” typically involves the model producing a relatively brief, straight-to-the-point argument, the sort of pared-down object for public consumption that a human might produce in “step 3″ of the 1-2-3- process listed above. (All the CoTs in Turpin et al look like this.)
    And all else being equal, we’d expect such CoTs to look like the products of all-too-human 1-2-3 motivated reasoning.
    But if you look at o1 CoTs, they don’t look like this. They verbalize much more of the “step 2” and even “step 1″ stuff, the stuff that a human would ordinarily keep inside their own head and not say out loud.
    And if we view o1 as an indication of what the pressure to increase capabilities is doing to CoT^[3], that seems like an encouraging sign. It would mean that models are going to talk more explicitly about the underlying drivers of their behavior than humans naturally do when communicating in writing, simply because this helps them perform better. (Which makes sense – humans benefit from their own interior monologues, after all.)
    (Last note: I’m curious how the voice modality interacts with all this, since humans speaking out loud in the moment often do not have time to do careful “step 2” preparation, and this makes naturally-occurring speech data importantly different from naturally-occurring text data. I don’t have any particular thoughts about this, just wanted to mention it.)
    ^
    In case you’re curious, I didn’t contrive that earlier stuff about the causal diagram for the sake of making this meta point later. I wrote it all out “naively,” and only realized after the fact that it could be put to an amusing use in this later section.
    ^
    Some of the Turpin et al experiments involved few-shots with their own CoTs, which “specifies what we want” in the CoT to some extent, and hence complicates the picture. However, the authors also ran zero-shot versions of these, and found broadly similar trends there IIRC.
    ^
    It might not be, of course. Maybe OpenAI actively tried to get o1 to verbalize more of the step ¹⁄₂ stuff for interpretability/safety reasons.
    What links here?
    the case for CoT unfaithfulness is overstated by nostalgebraist (29 Sep 2024 22:07 UTC; 244 points)
    - Reed 6 Oct 2024 11:32 UTC
      3 points
      2
      Parent
      Some excellent points (and I enjoyed the neat self-referentialism).
      Headline take is I agree with you that CoT unfaithfulness—as Turpin and Lanham have operationalised it—is unlikely to pose a problem for the alignment of LLM-based systems.
      I think this for the same reasons you state:
      
      1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;
      
      and
      
      2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1′s CoTs.
      The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal “Hmmmm...” that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: “But wait… The user implied that option A was probably correct”. This is partially an empirical question—since we can’t see the o1 CoTs, I’d pipedream love to see OpenAI do and publish research on whether this is true.
      
      This suggests to me that o1′s training might already have succeeded at giving us what we’d want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).
      
      The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning (“shooting outside the eighteen is not a common phrase in soccer”) in order to support a particular decision.
      You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes—this is why I’d be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).
- nostalgebraist 11 Oct 2024 1:30 UTC
  11 points
  2
  Parent
  Thanks for the links!
  I was pleased to see OpenAI reference this in their justification for why they aren’t letting users see o1′s CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).
  As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.
  The way I see things, the ability to read CoTs (and more generally “the fact that all LLM sampling happens in plain sight”) is a huge plus for both alignment and capabilities – it’s a novel way for powerful AI to be useful and (potentially) safe that people hadn’t even really conceived of before LLMs existed, but which we now held in our hands.
  So when I saw that o1 CoTs would be hidden, that felt like a turning point, a step down a very bad road that we didn’t have to choose.
  (Like, remember those Anthropic deception papers that had a hidden scratchpad, and justified it by saying it was modeling a scenario where the model had learned to do similar reasoning inside a forward pass and/or steganographically? At the time I was like, “yeah, okay, obviously CoTs can’t be hidden in real life, but we’re trying to model those other situations, and I guess this is the best we can do.”
  I never imagined that OpenAI would just come out and say “at long last, we’ve built the Hidden Scratchpad from Evan Hubinger’s sci-fi classic Don’t Build The Hidden Scratchpad”!)
  Although I saw some people expressing frustration about the choice to hide o1 CoTs, it didn’t seem like other people were reacting with the intensity I’d expect if they shared my views. And I thought, hmm, well, maybe everyone’s just written off CoTs as inherently deceptive at this point, and that’s why they don’t care. And then I wrote this post.
  (That said, I think I understand why OpenAI is doing it – some mixture of concern about people training about the CoTs, and/or being actually concerned about degraded faithfulness while being organizationally incapable of showing anything to users unless they put that thing under pressure look nice and “safe” in a way that could degrade faithfulness. I think the latter could happen even without a true feedback loop where the CoTs are trained on feedback from actual users, so long as they’re trained to comply with “what OpenAI thinks users like” even in a one-time, “offline” manner.
  But then at that point, you have to ask: okay, maybe it’s faithful, but at what cost? And how would we even know? If the users aren’t reading the CoTs, then no one is going to read them the vast majority of the time. It’s not like OpenAI is going to have teams of people monitoring this stuff at scale.)
  - Seth Herd 20 Oct 2024 18:43 UTC
    3 points
    0
    Parent
    My guess was that the primary reason OAI doesn’t show the scratchpad/CoT is to prevent competitors from training on those CoTs and replicating much of o1s abilities without spending time and compute on the RL process itself.
    
    But now that you mention it, their not wanting to show the whole CoT when it’s not necessarily nice or aligned in itself. I guess it’s like you wouldn’t want someone reading your thoughts even if you intended to be mostly helpful to them.
- Bogdan Ionut Cirstea 3 Oct 2024 13:08 UTC
  6 points
  0
  Parent
  Thanks a lot for posting these!
  I wrote some agenda-setting and brainstorming docs (#10 and #11 in this list) which people are welcome to read and comment on if interested.
  Unknowingly, I happen to have worked on some very related topics during the Astra Fellowship winter ’24 with @evhub (and also later). Most of it is still unpublished, but this is the doc (draft) of the short presentation I gave at the end; and I mention some other parts in this comment.
  (I’m probably biased and partial but) I think the rough plan laid out in #10 and #11 in the list is among the best and most tractable I’ve ever seen. I really like the ‘core system—amplified system’ framework and have had some related thoughts during Astra (comment; draft). I also think there’s been really encouraging recent progress on using trusted systems (in Redwood’s control framework terminology; often by differentially ‘turning up’ the capabilities on the amplification part of the system [vs. those of the core system]) to (safely) push forward automated safety work on the core system; e.g. A Multimodal Automated Interpretability Agent. And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system. [Generalized] Inference scaling laws also seem pretty good w.r.t. this kind of plan, though worrying in other ways.
  - Bogdan Ionut Cirstea 3 Nov 2024 14:57 UTC
    0 points
    −2
    Parent
    And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system.
    E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM (‘core system’), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).