Multiple choice is just pretty hard! This seems somewhat similar to the results the DeepMind mech interp team found about a moderately complex circuit for multiple choice questions in Chinchilla 70B, it wouldn’t surprise me if for many of the smaller models (in this case, 13B ish) they just aren’t good at mapping their knowledge to the multiple choice syntax. Though I expect that to go away for larger models, as you see with GPT-4
Thanks for your comment and letting me know about that work! Yeah it does look like with GPT-4 that the difference goes away. After a quick look at that paper, it appears that the tasks that were considered were the high performing MMLU tasks. The moral scenarios task seems harder in that the answers themselves don’t have inherent meaning, so it almost seems like there is a second mapping or reasoning step that needs to take place. Maybe you or someone else can better articulate the semantic challenge than I can at the moment.
The smaller model that performs well on the original task is one that is trained with an orca style dataset(dataset rich in reasoning). I found it interesting that it performed well on the original task, but not better on the single scenarios. Curious if you have done any interpretability work on models trained with datasets rich in reasoning and how they differ from others.
Multiple choice is just pretty hard! This seems somewhat similar to the results the DeepMind mech interp team found about a moderately complex circuit for multiple choice questions in Chinchilla 70B, it wouldn’t surprise me if for many of the smaller models (in this case, 13B ish) they just aren’t good at mapping their knowledge to the multiple choice syntax. Though I expect that to go away for larger models, as you see with GPT-4
Thanks for your comment and letting me know about that work! Yeah it does look like with GPT-4 that the difference goes away. After a quick look at that paper, it appears that the tasks that were considered were the high performing MMLU tasks. The moral scenarios task seems harder in that the answers themselves don’t have inherent meaning, so it almost seems like there is a second mapping or reasoning step that needs to take place. Maybe you or someone else can better articulate the semantic challenge than I can at the moment.
The smaller model that performs well on the original task is one that is trained with an orca style dataset(dataset rich in reasoning). I found it interesting that it performed well on the original task, but not better on the single scenarios. Curious if you have done any interpretability work on models trained with datasets rich in reasoning and how they differ from others.