Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That’s helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we’re sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt. (That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That’s helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we’re sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt.
(That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.