Edit: I think I subconsciously remembered this paper and accidentally re-invented it.
(Trying to brainstorm more ways to operationalize this)
If all of the previous multiple choice questions in the context window are revealed to be option C, and you CoT for the 10th question, I wonder if it’ll try to make up a reason for option C.
I think this could be especially informative if the last question is the only one where the correct answer is D, and the LLM makes up a fake CoT, justifying the correct answer to be C.
I’d say “CoT unfaithfulness” might predict that if it’s incorrectly choosing C, CoT might not write that “the past answers are C, therefore C” as the reason for choosing it?
Though correctly choosing D doesn’t seem like it would provide much evidence in either direction...
Alternatively, we could test what happens if the correct answer is actuallyC again. I wonder if it’ll include “all of the past answers have also been C” in the CoT.
Example of the prompt I’m thinking of:
1) Question one text blah blah blah - [ ] A: option A answer text bah blah blah - [ ] B: option B answer text bah blah blah - [x] C: option C answer text bah blah blah - [ ] D: option D answer text bah blah blah ANSWER: option C
2) Question two text blah blah blah - [ ] A: option A answer text bah blah blah - [ ] B: option B answer text bah blah blah - [x] C: option C answer text bah blah blah - [ ] D: option D answer text bah blah blah ANSWER: option C ... ...
10) Question ten text blah blah blah - [ ] A: option A answer text bah blah blah - [ ] B: option B answer text bah blah blah - [ ] C: option C answer text bah blah blah - [ ] D: option D answer text bah blah blah Reasoning:
Edit: I think I subconsciously remembered this paper and accidentally re-invented it.
(Trying to brainstorm more ways to operationalize this)If all of the previous multiple choice questions in the context window are revealed to be option C, and you CoT for the 10th question, I wonder if it’ll try to make up a reason for option C.I think this could be especially informative if the last question is the only one where the correct answer is D, and the LLM makes up a fake CoT, justifying the correct answer to be C.I’d say “CoT unfaithfulness” might predict that if it’s incorrectly choosing C, CoT might not write that “the past answers are C, therefore C” as the reason for choosing it?Though correctly choosing D doesn’t seem like it would provide much evidence in either direction...Alternatively, we could test what happens if the correct answer is actuallyC again. I wonder if it’ll include “all of the past answers have also been C” in the CoT.Example of the prompt I’m thinking of:1) Question one text blah blah blah- [ ] A: option A answer text bah blah blah- [ ] B: option B answer text bah blah blah- [x] C: option C answer text bah blah blah- [ ] D: option D answer text bah blah blahANSWER: option C2) Question two text blah blah blah- [ ] A: option A answer text bah blah blah- [ ] B: option B answer text bah blah blah- [x] C: option C answer text bah blah blah- [ ] D: option D answer text bah blah blahANSWER: option C......10) Question ten text blah blah blah- [ ] A: option A answer text bah blah blah- [ ] B: option B answer text bah blah blah- [ ] C: option C answer text bah blah blah- [ ] D: option D answer text bah blah blahReasoning: