I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.
I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they “want” to arrive at. Do you have an intuition for how this actually works?
Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model “plan ahead” in this case?
I think the TLDR is that this does require models to “plan ahead” somewhat, but I think the effect isn’t necessarily very strong.
I don’t think “planning” in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often—models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn’t put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the “planning” effects are limited in strength, but I definitely would expect this to get worse with better models.
> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That’s interesting. Any idea why it’s likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That’s helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we’re sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt. (That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.
I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they “want” to arrive at. Do you have an intuition for how this actually works?
Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model “plan ahead” in this case?
Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976
I think the TLDR is that this does require models to “plan ahead” somewhat, but I think the effect isn’t necessarily very strong.
I don’t think “planning” in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often—models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn’t put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the “planning” effects are limited in strength, but I definitely would expect this to get worse with better models.
> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That’s interesting. Any idea why it’s likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
Towards the end it’s easier to see how to change the explanation in order to get the ‘desired’ answer.
Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That’s helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we’re sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt.
(That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.