instruction tuning and autoregressive distribution shift

[Note: this began life as a “Quick Takes” comment, but it got pretty long, so I figured I might as well convert it to a regular post.]

In LM training, every token provides new information about “the world beyond the LM” that can be used/​”learned” in-context to better predict future tokens in the same window.

But when text is produced by autoregressive sampling from the same LM, it is not informative in the same way, at least not to the same extent[1]. Thus, sampling inevitably produces a distribution shift.

I think this is one of the reasons why it’s (apparently) difficult to get instruction-tuned /​ HH-tuned models to report their uncertainty and level of competence accurately, rather than being overconfident.

(I doubt this is a novel point, I just haven’t seen it spelled out explicitly before, and felt like doing so.)


Imagine that you read the following (as the beginning of some longer document), and you trust that the author is accurately describing themselves:

I’m a Princeton physics professor, with a track record of highly cited and impactful research in the emerging field of Ultra-High-Density Quasiclassical Exotic Pseudoplasmas (UHD-QC-EPPs).

The state of the art in numerical simulation of UHD-QC-EPPs is the so-called Neural Pseudospectral Method (NPsM).

I made up all those buzzwords, but imagine that this is a real field, albeit one you know virtually nothing about. So you’ve never heard of “NPsM” or any other competing method.

Nonetheless, you can confidently draw some conclusions just from reading this snippet and trusting the author’s self-description:

  • Later in this document, the author will continue to write as though they believe that NPsM is “the gold standard” in this area.

    • They’re not going to suddenly turn around and say something like “wait, whoops, I just checked Wikipedia and it turns out NPsM has been superseded by [some other thing].” They’re a leading expert in the field! If that had happened, they’d already know by the time they sat down to write any of this.

  • Also, apart from this particular writer’s beliefs, it’s probably actually true that NPsM is the gold standard in this area.

    • Again, they’re an expert in the field—and this is the sort of claim that would be fairly easy to check even if you’re not an expert yourself, just by Googling around and skimming recent papers. It’s also not the sort of claim where there’s any obvious incentive for deception. It’s hard to think of a plausible scenario in which this person writes this sentence, and yet the sentence is false or even controversial.

During training, LLMs are constantly presented with experiences resembling this one.

The LLM is shown texts about topics of which it has incomplete knowledge. It has to predict each token from the preceding ones.

Whatever new information the text conveys about the topic may make it into the LLM’s weights, through gradient updates on this example. But even before that happens, the LLM can also use the kind of reasoning shown in the bulleted list above to improve its predictions on the text right now (before any gradient updates).

That is, the LLM can do in-context learning, under the assumption that the text was produced by an entity outside itself—so that each part of the text (potentially) provides new information about the real world, not yet present in the LLM’s weights, that has useful implications for the later parts of the same text.

So, all else being equal, LLMs will learn to apply this kind of reasoning to all text, always, ubiquitously.

But autoregressive sampling produces text that is not informative about “the world outside” in the same way that all the training texts were.

During training, when an LLM sees information it doesn’t know yet, it’s incentivized to think: “ooh, new info! I should leverage this to predict the rest of the text!” But during sampling, any information in the sampled text which the LLM “doesn’t know” is (by definition) confabulated, and updating on it as though it’s real will only make the LLM more confused about reality.


In some sense, all instruction/​chat tuning (including SFT, RLHF, etc.) is simply a less crude version of the popular style of LLM prompt that starts off like:

You are a highly capable expert at [thing]

That is, instruction/​chat tuning is trying to steer the outputs of the model so that it looks like the model is conditioning on “the output is high-quality.”

(I often think about these techniques as form of “ecological evaluation” as I defined it here, just in a weird “meta” way that I hadn’t imagined as a possibility when I wrote that post.

Rather than fixing a specific task and then giving the model direct incentives to do a good job at that one task, these methods show the model a bunch of pairs like (task description X, text that does a good job at X), and give the model a direct incentive to produce the latter from the former. The model’s generalization and language-understanding abilities are leveraged to learn the general rule “given task description X, what follows is text that does a good job at X,” and this works even for X that were never seen in training.)

Unfortunately, there is an inherent tension between this kind of thing and the dynamic described above.

We want the model to give the task (“X”) its best shot—to produce something that aligns with its own internal representation of “actual high-quality X performance in the real world,” as opposed to “entity Y’s typical attempt at X” or “what entity Z thinks it means to do X well.”

For instance, with declarative knowledge, we may want the model to report its sense of the latent actual truth that implicitly guides all training texts in one way or another, as opposed to its sense of what some particular person writing this or that particular text would probably say.

So, we (explicitly or implicitly) condition the model on quality. We convey to the model that it’s generating the kind of text which is correlated with actual-truth, as opposed to just being what some guy said: text by “an expert,” in a sort of idealized sense of the term.

Effectively, we make the model act as though it’s always generating texts similar to my example from the Princeton physics professor, with the exact nature of the (implicit) expertise always precisely selected to be ideal for the task at hand, for correlation with actual-truth and actual-doing-a-good-job.

But then—all else being equal, i.e. if post-training isn’t set up to provide a clear signal that this is bad—we are effectively maximizing the extent to which the LLM will exhibit the in-context learning dynamic I described earlier, with the LLM viewing its own confabulations as valuable evidence about reality, provided by a “reliable source” from the world beyond its weights!

Hence, I think, the extreme confidence of instruct/​chat-tuned models, and their extreme reluctance to revise their opinions (unless directly asked to do so, and sometimes even then), or to say anything amounting to “I notice that I am confused.”

Why would it say “whoops, I was wrong, the answer’s actually Q (not P like I said before)”? It’s an expert, it would know this sort of thing already. (What sort of expert? Why, exactly the sort who would know this sort of thing, whatever “this sort of thing” happens to be.)

Why would it notice its own confusion? To do so (and be right), it has to first say something confused. But the ideal expert is never confused in the first place. The surest way to be correlated with actual-truth is to only say true things, and never say anything else.


I don’t think this is the only reason that it’s difficult to get such models to accurately report their own confidence and capability level.

It’s also relatively difficult to produce training data /​ annotations for this kind of behavior.

To produce data that trains the model to always act like an “ideal expert” (even in cases where the model doesn’t have the knowledge to back up this facade), the annotator only needs to determine what’s actually-true. This will train the model to do the right thing in cases where it does have the knowledge, and to bullshit in all other cases.

But, to get the model to (e.g.) say “I don’t know” instead of bullshitting, the annotator needs to additionally know what the model knows, as distinct from what’s actually-true. And that’s hard to determine! I don’t think this is difficult in some deep, fundamental sense[2], but it is at least strictly harder than just providing high-quality demonstrations.

The dynamic described earlier is an additional factor that means the behavior we want never happens by default. Therefore, we have to explicitly train for it if we want it. But as just noted, training for it is not easy.

  1. ^

    I include the caveat “at least not to the same extent” because of nuances involving LMs doing CoT-style reasoning, LMs “reminding themselves” of things that they in-some-sense “know” yet sometimes “forget,” etc.

  2. ^

    For instance, one obvious approach would be to start off with HH chat tuning (producing an “expert” that bullshits when it doesn’t know the answer), and then do a second tuning phase on text generated by this “expert” that encourages it to be more cautious in cases where the originally generated text wasn’t actually-true (and/​or where its content was inconsistent across multiple sampling runs, or something).