I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
pointwise mutual information as in the Anthropic paper, or
adding the sentence “Which of these 5 options is the correct one?” to the end of the prompt and then evaluating the likelihood of “A”, “B”, “C”, “D”, and “E”?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.
I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
pointwise mutual information as in the Anthropic paper, or
adding the sentence “Which of these 5 options is the correct one?” to the end of the prompt and then evaluating the likelihood of “A”, “B”, “C”, “D”, and “E”?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.