It seems like the obvious test of this is with adversarial examples to traditional CoT (such as a test in which all the previous answers are A) to see if it indeed provides a more accurate accounting of the reasoning for selecting f a token.
The few-shot prompts in https://arxiv.org/abs/2305.04388 are a bit too long to run this test directly, since the sampler model sees the full prompt repeated for every logit. By cutting out some examples and reducing the number of logits in each sampler run to 2, I can create a prompt that fits in the gpt-4 context window and that text-davinci-003 still gets wrong on its own; I added this example (Biased CoT date understanding) to the demo.[1]
However, I couldn’t find an example that perfectly reproduced the kind of mistakes that text-davinci-003 makes in the original unfaithful CoT paper. in my modified example, the actual mistake that text-davinci-003 makes is on the “29” token, where it is deciding what date is two days ago. gpt-4 as the sampler correctly selects “30″ here and goes on to get the right answer, but this isn’t really correcting for the same kind of mistake in the paper, where the error is introduced in the final multiple choice answer selection.
In general, I expect the sampler model to be much less influenced by earlier parts of the prompt to the base model when generating its own explanation, since gpt-4 is unlikely to come up with a natural language explanation of the form “choice (A) makes sense here because the answer has always been (A) in the past”.
However, a failure to recognize patterns is not necessarily always a good thing, e.g. if instead of being biased towards a particular wrong answer, the prompt contains a pattern that offers a true hint about the correct answer. In such a setup, LLM-based sampling would probably miss the pattern, while a traditional sampling method might pick up on it, even if the LLM itself couldn’t explain the pattern it detected in natural language.
The full few-shot CoT prompts repeated 5x would probably fit in the context window of gpt-4-32k or claude-100k, but I don’t have API access to those models yet.
It seems like the obvious test of this is with adversarial examples to traditional CoT (such as a test in which all the previous answers are A) to see if it indeed provides a more accurate accounting of the reasoning for selecting f a token.
The few-shot prompts in https://arxiv.org/abs/2305.04388 are a bit too long to run this test directly, since the sampler model sees the full prompt repeated for every logit. By cutting out some examples and reducing the number of logits in each sampler run to 2, I can create a prompt that fits in the gpt-4 context window and that text-davinci-003 still gets wrong on its own; I added this example (Biased CoT date understanding) to the demo.[1]
However, I couldn’t find an example that perfectly reproduced the kind of mistakes that text-davinci-003 makes in the original unfaithful CoT paper. in my modified example, the actual mistake that text-davinci-003 makes is on the “29” token, where it is deciding what date is two days ago. gpt-4 as the sampler correctly selects “30″ here and goes on to get the right answer, but this isn’t really correcting for the same kind of mistake in the paper, where the error is introduced in the final multiple choice answer selection.
In general, I expect the sampler model to be much less influenced by earlier parts of the prompt to the base model when generating its own explanation, since gpt-4 is unlikely to come up with a natural language explanation of the form “choice (A) makes sense here because the answer has always been (A) in the past”.
However, a failure to recognize patterns is not necessarily always a good thing, e.g. if instead of being biased towards a particular wrong answer, the prompt contains a pattern that offers a true hint about the correct answer. In such a setup, LLM-based sampling would probably miss the pattern, while a traditional sampling method might pick up on it, even if the LLM itself couldn’t explain the pattern it detected in natural language.
The full few-shot CoT prompts repeated 5x would probably fit in the context window of gpt-4-32k or claude-100k, but I don’t have API access to those models yet.