What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
basic multi-step reasoning within their forward passes.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!
Yep.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
Thanks Thane for your comments!
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn’t seem to be a clear trend.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!