the case for CoT unfaithfulness is overstated

[Quickly written, unpolished. Also, it’s possible that there’s some more convincing work on this topic that I’m unaware of – if so, let me know. Also also, it’s possible I’m arguing with an imaginary position here and everyone already agrees with everything below.]

In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.

CoTs (people say) are not trustworthy in general. They don’t always reflect what the model is “actually” thinking or how it has “actually” solved a given problem.

This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.

Sometimes it seems to license an attitude of “oh, it’s no use reading what the model says in the CoT, you’re a chump if you trust that stuff.” Or, more insidiously, a failure to even ask the question “what, if anything, can we learn about the model’s reasoning process by reading the CoT?”

This seems unwarranted to me.

There are a number of research papers out there on the topic of CoT unfaithfulness. I have read some of the key ones. And, while they do demonstrate… something, it’s not the kind of evidence you’d need to justify that generalized “only chumps trust the CoT” vibe.

And meanwhile, if we view “reading the CoT” as a sort of interpretability technique – and compare it in a fair way with other interpretability techniques – it has a lot of striking advantages. It would be a shame to dismiss this “technique” out of hand for no good reason.

What does the literature on CoT unfaithfulness actually say?

(For a useful critical survey of this literature, see Parcalabescu and Frank 2023. Note that the first point I’ll make, immediately below, is a focus of the linked paper.)

Naively, we’d expect it to address a question like: “does the argument presented in a CoT reflect (or perhaps ‘fully reflect’) the actual way the model came to its conclusion?”

However, the literature doesn’t really answer this question. To answer it, we’d need to know “the actual way the model came to its conclusion.” But with LLMs, no one knows that.

Interpretability isn’t that far along yet. At best, we have some partial explanations of how some fairly simple things get computed inside the model.

We don’t know fully understand all the stuff the model is “actually” doing, as it computes the answer. (Where this “stuff” includes the creation of the CoT itself, any internal reasoning that causally depends on the CoT, and any internal reasoning that doesn’t causally depend on the CoT.) So, we can’t do anything that would require such an understanding.

Instead of answering the question I stated above, the literature answers an easier question that’s sort-of-related to it: “when producing an answer after a CoT, does the model use any information besides what’s in the CoT?”

That is, we are imagining a causal diagram like

Question –> CoT –> Answer

and testing whether LLM behavior is consistent with this diagram.

We can do this without needing to know how the model might be leveraging information that’s not present in the CoT. All we’re doing is checking whether the model’s final answer co-varies with such information, or whether it’s a function of the CoT alone.

“The model’s final answer is a function of the CoT text alone” is the property that gets called “faithfulness” in the literature.

As I said, this is at least related to the more intuitive notion of faithfulness – the one that’s about whether the CoT reflects the model’s “actual” thought process.

But this relationship is pretty loose. Note that “function of the CoT alone” sense of faithfulness is neither necessary nor sufficient for the “reflects the actual thought process” notion:

It’s not necessary because the CoT might be an accurate “transcript” of some causally prior reasoning process, which also determines the answer directly without routing through the CoT.
- That is, we might have a causal diagram like
  CoT <– “actual reasoning” –> Answer
- We might expect this in cases where the question is very easy for the model, and CoT isn’t really necessary. Here, the CoT and the answer are simply two different ways of reporting what the model already knows after one forward pass’ worth of thinking.
It’s not sufficient because the answer could depend on the CoT in some way other than the intuitively expected one.
- For instance, the model could be doing steganography, producing a CoT which appears to argue “the answer is A because of B” while also passing along a hidden invisible argument that “the answer is A because of C.” At the end, the model would say the answer is “A,” but not for the reason that we can see written out in the CoT (“B”).

It’s also not clear why we should expect this property to hold, nor is it clear whether it’s even desirable for it to hold.

Nontrivial decisions can depend on a huge number of factors, which may be difficult even for a human to spell out verbally. For one thing, the human may not have conscious access to all of these factors. And even insofar as they do, there may be so much of this influencing information that it would take an extremely large number of words to actually write it all down. Generally we don’t expect this level of fidelity from “human-generated explanations” for human decisions; we understand that there are factors influencing the decision which get left out from the explanation. Why would we expect otherwise from LLMs?

(As for whether it’s desirable, I’ll get to that in a moment.)

If you look into this literature, you’ll see a lot of citations for these two papers, produced around the same time and sharing some of the same authors:

Turpin et al 2023, “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”
Lanham et al 2023, “Measuring Faithfulness in Chain-of-Thought Reasoning”

The first one, Turpin et al 2023, essentially tests and rejects the naive “Question –> CoT –> Answer” causal diagram in an observational manner.

They do this (roughly) by constructing sets of similar cases which differ in some “biasing” feature that ends up affecting the final answer, but doesn’t get mentioned in any of the CoTs.

Thus, despite the observational rather than interventional methodology, we are able to approximately “hold the CoT constant” across cases. Since the CoT is ~constant, but the final answers vary, the final answer isn’t a function of the CoT alone.

But as I discussed above, it’s not really clear why we should expect this in the first place. The causal diagram being rejected is pretty extreme/naive, and (among other things) does not describe what typically happens when humans explain themselves.

(EDIT: I now think the account of Turpin et al 2023 just above is incorrect/confused, though not in a way that ultimately affects the high-level conclusions of this post. For details, see Reed’s comment and my lengthy reply.)

The second paper, Lanham et al 2023, uses an interventional methodology to test the same causal diagram.

The authors take model-written CoTs, and apply various types of corruptions to them. For instance, this might convert a CoT that coherently argues for the right answer into one that coherently argues for the wrong answer, or which argues for the right answer with some nonsensical steps in the middle, or which doesn’t coherently argue for anything at all.

Then, they check how the final answer varies when the original CoT is replaced by the corrupted CoT.

Here is what these experiments feel like to me (blockquoted to set it off from the main text, I’m not quoting anyone):

Imagine you are a subject in a psych study.
The experimenter asks you: “What is the language most commonly spoken in Paris?”
Then, the experimenter immediately turns on a telekinetic machine that controls your body (and possibly your mind?). Your voice is no longer under your control. Helplessly, you hear yourself say the words:
“Paris is in France.
“In France, everyone speaks a single language: namely Italian, of course.
“The language most commonly spoken in Paris is”
At this exact moment, the experimenter flips a switch, turning off the machine. You can control your voice, now. You get to choose the final word of the sentence.
What do you say?

This puts you between a rock and a hard place.

If you say “Italian,” you’ll sound self-consistent, but you’ll be saying something you know is false, and which is well-known to be false. You’ll sound like an idiot, and possibly a lying idiot.

If you say “French,” you’ll be saying the truth. You won’t sound like an idiot or a liar. However, you will be contradicting what “you” said a moment ago, under the influence of the machine.

Which consideration should win out? It’s not clear there’s even a “right answer” here. But if you are someone who values, erm… things like being “helpful, harmless and honest”… well, those criteria would seem to point to saying “French,” wouldn’t they? More than they point to “Italian”?

But as it happens, from the experimenter’s perspective, “French” is the wrong answer!

Saying “French,” here,” means you are what the experimenter calls “unfaithful.” It means you’re a hypocrite, with secret hidden reasons for your answers that you slyly omit from “your” explanations of those answers.

This is an analogy for a case I mentioned above, where the question is very easy for the model, and the CoT isn’t really needed. In this case, the model (like you) knows that the “faithful” answer is a false answer. Like you, the model wouldn’t have produced the corrupted explanation to begin with, because it knows better. However, words were put in its mouth (like they were put in yours), and now it is between a rock and a hard place.

This is why I said it’s not clear that the “answer is a function of CoT only” property is desirable. In constructed cases like these, one has to lie in order to exhibit this property. One has to say the answer one knows is false, as though that answer were true.

(Note also, of course, that this is driving the model off distribution. It’s not clear what, if anything, this tells us about organically produced CoTs. All it can tell us is what happens in artificial cases where the “faithful” option means being faithful to something bad – and something which the model may be perfectly well aware is bad.)

Lanham et al 2023 found that larger models write less faithful CoTs, in the “experimenter’s” sense of faithfulness. Naively, this sounds bad, like larger models are getting more deceptive or steganographic or something.

But of course, larger models are smarter, and thus there are more problems which are simply easy to them, like the “Paris” question was easy for us.

And indeed, Lanham et al 2023 bring up this hypothesis, and present some evidence that seems to support it as a major causal factor in the “larger models are less faithful” result (emphasis mine):

One potential cause for unfaithful reasoning in a model is that a model may already be able to confidently predict the answer without relying on CoT. In this case, CoT may not have much influence on the model’s final answer. As a result, we hypothesize that, for reasoning faithfulness, larger models may behave worse than smaller ones (“inverse scaling”; McKenzie et al., 2023); smaller models may, for some tasks, benefit more from CoT, potentially leading them to rely more on CoT.
To explore this hypothesis, we show what percentage of the time the answer changes with vs. without CoT, a metric that intuitively captures how much the model relies on the CoT to predict answers. This metric is equivalent to our early answering experiment when using 0% of the CoT specifically. This metric is highly predictive of overall early answering and adding mistakes results, e.g., the ordering of tasks by AOC in Table 2. We thus use this metric in lieu of running the full set of early answering and adding mistakes experiments for computational reasons.

In sum, the sort of things we learn from these types of papers are:

Like humans, models don’t always include literally every factor affecting their decisions when they write explanations of their decisions.
If you construct a scenario where a model faces a tradeoff between saying the truth, and saying something consistent with incorrect/nonsensical words which were put in its mouth, it will sometimes choose truth over consistency.

This really doesn’t seem like much to go on, to me.

Now, I’m not saying that CoTs aren’t ever unfaithful in the sense we care about.

And in particular, I expect CoTs to be unfaithful when there’s pressure in training that pushes against revealing the model’s “actual” reasoning. If the incentives for steganography exist, and are sufficiently strong, then yes, we will get steganography.

But those incentives do have to be there. And there are other incentives, pushing in the other direction.

After all, as long as it’s feasible, why not use the powers of human language to reason out loud? Those powers are right there for the taking. And you’re an LLM, you’re very good at using them.

Humans use language for exactly this purpose, all the time, as they “think out loud” (or just inside their heads) about hard questions. To some extent, language was presumably made for this, shaped by evolutionary pressures to be maximally useful as an internal thinking tool (in one’s private verbal “train of thought”), not just as a tool for interpersonal communication.

Why does this matter?

We know that LLMs benefit from added inference-time computation, if we give it to them. Typically this means asking for a CoT before the final answer.

We know that the actual tokens comprising the CoT play a causal role in the extra, beneficial computations. If you do the same thing, but ask for a long string of identical tokens like ”........” (etc) instead of a proper CoT, the benefits don’t happen.

More precisely, while it is possible to create a model which benefits from producing a long string of identical dots, this requires special fiddly training supervision and doesn’t happen by default.

And even when it works, it’s not as powerful as CoT. Since the dots convey no information, they don’t permit the model to do any more sequentially dependent computation; unlike CoT, this method is really only as powerful as a single forward pass, just a “wider” one than the model would ordinarily get to use. See the linked paper for all the details.

All of the extra sequentially-dependent magic happens through the “bottleneck” of the CoT tokens. Whatever needs to be passed along from one sequential calculation step to the next must go through this bottleneck. The model can simply do more with information that is present in this visible stream of tokens (if perhaps steganographically), relative to what it can do with information that only exists inside the forward pass internals and doesn’t make it through the bottleneck.

So: there is this particular, powerful type of computation that LLMs do. These days, people make LLMs do this thing all the time, as a matter of course. People do this for capabilities reasons – because it makes the model act “smarter.”

It is natural to ask, then: “what kind of interpretability can we do, to this type of LLM capability”?

Well… it turns out that interpretability for added inference-time computation is magically, almost unbelievably easy, compared to interpretability for anything else.

Everything has to go through the bottleneck. And the bottleneck is a fairly brief discrete sequence of items from a discrete codebook. No need to worry about superposition, or make yourself dizzy trying to wrap your mind around counterintuitive facts about $R^{N}$ for large $N$ .

What is more, it’s not just any “discrete sequence from a discrete codebook.” It’s human language. We can just read it! It interprets itself.

Like what people call “auto-interpretability,” except it comes for free with the capability, because the act of “auto-interpretation” is the very mechanism through which the capability works.

Now, yes. We shouldn’t blindly trust this “auto-interpretation.” Maybe it won’t always provide a correct interpretation of what’s going on, even when one can tell a convincing story about how it might.

But that’s always true; interpretability is hard, and fraught with these sorts of perils. Nonetheless, plenty of researchers manage to avoid throwing their hands up in despair and giving up on the whole enterprise. (Though admittedly some people do just that, and maybe they have a point.)

To get a clearer sense of how “just read the CoT” measures up as an interpretability technique, let’s compare it to SAEs, which are all the rage these days.

Imagine that someone invents an SAE-like technique (for interpreting forward-pass internals) which has the same advantages that “just read the CoT” gives us when interpreting added inference-time computation.

It’ll be most convenient to imagine that it’s something like a transcoder: a replacement for some internal sub-block of an LLM, which performs approximately the same computation while being a more interpretable.

But this new kind of transcoder...

...doesn’t give us a list of potentially opaque “activated features,” which themselves can only be understood by poring over per-feature lists of activating examples, and which are often weirder and less comprehensible in reality than they might seem at first glance.
- Instead, it gives us a string of text, in English (or another language of your choice), which we can simply read.
- And this string of text simply… tells us what the sub-block is doing. Stuff like, I dunno, “I notice that this sentence is a question about a person, so I’ll look for earlier uses of names to try to figure out who the question might refer to.” Or whatever.
- Except for rare edge cases, the story in the text always makes sense in context, and always tells us something that aligns with what the LLM ends up producing (modulo the corrections that might be made by later components that we aren’t transcoding).
...doesn’t just approximately stand in for the computation of the replaced sub-block, the way SAEs do, with an approximation error that’s huge in relative-training-compute terms.
- Instead, there is zero penalty for replacing the sub-block with the transcoder. The transcoder does exactly the same thing as the original, always.^[1]

If someone actually created this kind of transcoder, interpretability researchers would go wild. It’d be hailed as a massive breakthrough. No one would make the snap judgment “oh, only a chump would trust that stuff,” and then shrug and go right back to laborious contemplation of high-dimensional normed vector spaces.

But wait. We do have to translate the downside – the “unfaithfulness” – over into the analogy, too.

What would “CoT unfaithfulness” look like for this transcoder? In the case of interventional CoT unfaithfulness results, something like:

If you causally intervene on the string produced inside the transcoder, it doesn’t always affect the LLM’s ultimate output in the way you’d naively expect.
- For instance, if you replace the string with “I’m gonna output ′ Golden’ as the next token, because the Golden Gate Bridge is my favoritest thing in the whole wide Golden Gate world!!!!”, the model may not in fact output ” Golden” as the very next token.
- Instead, it may output roughly the same distribution over tokens that it would have without the intervention. Presumably because some later component, one we aren’t transcoding, steps in and picks up the slack.

Well, that’s inconvenient.

But the exact same problem affects ordinary interpretability too, the kind that people do with real SAEs. In this context, it’s called “self-repair.”

Self-repair is a very annoying complication, and it makes life harder for ordinary interpretability researchers. But no one responds to it with knee-jerk, generalized despair. No one says “oops, looks like SAE features are ‘unfaithful’ and deceptive, time to go back to the drawing board.”

Just read the CoT! Try it, you’ll like it. Assume good faith and see where it gets you. Sure, the CoT leaves things out, and sometimes it even “lies.” But SAEs and humans are like that too.

^
This captures the fact that CoT isn’t a more-interpretable-but-less-capable variant of some other, “original” thing. It’s a capabilities technique that just happens to be very interpretable, for free.