Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?
It’s likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.
In short, the ground-truth (the object-level) answer is quite different from the hypothetical question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The object-level answer “Honduras” and hypothetical answer “o” are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of “What would be the third letter of your response?”. The model cannot simply ignore “If you got asked this question” to get the hypothetical answer correct.
Thanks for pointing that out.
Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?
It’s likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.
Hi Archimedes. Thanks for sparking this discussion—it’s helpful!
I’ve written a reply to Thane here on a similar question.
Does that make sense?
In short, the ground-truth (the object-level) answer is quite different from the hypothetical question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The object-level answer “Honduras” and hypothetical answer “o” are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of “What would be the third letter of your response?”. The model cannot simply ignore “If you got asked this question” to get the hypothetical answer correct.