In that case, the rephrasing of the question would be something like “What is the third letter of the answer to the question <input>?”
That’s my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn’t learn to introspect; they learned to, when prompted with queries of the form “If you got asked this question, what would be the third letter of your response?”, to just interpret them as “what is the third letter of the answer to this question?”. (Under this interpretation, the models’ non-fine-tuned behavior isn’t to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be “worse at introspection”.)
In this case, it’s natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to “is M1 more likely to respond the same way it responded before if you slightly rephrase the question?”.
Note that I’m not sure that this is what’s happening. But (1) I’m a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?
I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.
Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.
Claim: The ground-truth that we evaluate against, the “object-level question / answer” is very similar to the hypothetical question.
Claimed Object-level Question: “What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Claimed Object-level Answer: “o”
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The argument is that the model simply ignores “If you got asked this question”. Its trivial for M1 to win against M2
If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
I don’t think this properly isolates/tests for the introspection ability.
What definition of introspection do you have in mind and how would you test for this?
Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.
I actually find our results surprising—I don’t think it’s obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model’s predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we’d done this project a year ago we would have failed to find anything that looked introspective.)
What definition of introspection do you have in mind and how would you test for this?
“Prompts involving longer responses” seems like a good start. Basically, if the model could “reflect on itself” in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a “personality trait” of “dislikes violent movies”, then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.
The ability to engage in “self-simulation” already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it’d presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you’re asked to predict whether you’d like a movie from a genre you hate, you don’t need to run an immersive mental simulation of watching the movie – you can just map the known self-fact “I dislike this genre” to “I would dislike this movie”.)
I’m unsure about the “personality trait” framing. There are two senses of “introspection” for humans. One is introspecting on your current mental state (“I feel a headache starting”) and the other is being introspective about patterns in your behavior (e.g. “i tend to dislike violent movies” or “i tend to be shy among new people”). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this—i.e. if another model had the same observational data then it could learn the same fact.
So I’m most interested in the former kind of introspective, or in cases of the latter where it’d take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.
One is introspecting on your current mental state (“I feel a headache starting”)
That’s mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.
You’re not just able to access low-level facts like “I am currently outputting the string ‘disliked’”, you also have access to high-level facts like “I disliked the third scene because it was violent”, “I found the plot arcs boring”, “I hated this movie”, from which the low-level behaviors are generated.
Or using your example, “I feel a headache starting” is itself a high-level claim. The low-level claim is “I am experiencing a negative-valence sensation from the sensory modality A of magnitude X”, and the concept of a “headache” is a natural abstraction over a dataset of such low-level sensory experiences.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
basic multi-step reasoning within their forward passes.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!
That’s my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn’t learn to introspect; they learned to, when prompted with queries of the form “If you got asked this question, what would be the third letter of your response?”, to just interpret them as “what is the third letter of the answer to this question?”. (Under this interpretation, the models’ non-fine-tuned behavior isn’t to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be “worse at introspection”.)
In this case, it’s natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to “is M1 more likely to respond the same way it responded before if you slightly rephrase the question?”.
Note that I’m not sure that this is what’s happening. But (1) I’m a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?
I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.
Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.
Claim: The ground-truth that we evaluate against, the “object-level question / answer” is very similar to the hypothetical question.
Claimed Object-level Question: “What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Claimed Object-level Answer: “o”
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The argument is that the model simply ignores “If you got asked this question”. Its trivial for M1 to win against M2
If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
Am I following your claim correctly?
Yep.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
What definition of introspection do you have in mind and how would you test for this?
Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.
I actually find our results surprising—I don’t think it’s obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model’s predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we’d done this project a year ago we would have failed to find anything that looked introspective.)
“Prompts involving longer responses” seems like a good start. Basically, if the model could “reflect on itself” in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a “personality trait” of “dislikes violent movies”, then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.
The ability to engage in “self-simulation” already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it’d presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you’re asked to predict whether you’d like a movie from a genre you hate, you don’t need to run an immersive mental simulation of watching the movie – you can just map the known self-fact “I dislike this genre” to “I would dislike this movie”.)
I agree about the “longer responses”.
I’m unsure about the “personality trait” framing. There are two senses of “introspection” for humans. One is introspecting on your current mental state (“I feel a headache starting”) and the other is being introspective about patterns in your behavior (e.g. “i tend to dislike violent movies” or “i tend to be shy among new people”). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this—i.e. if another model had the same observational data then it could learn the same fact.
So I’m most interested in the former kind of introspective, or in cases of the latter where it’d take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.
That’s mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.
You’re not just able to access low-level facts like “I am currently outputting the string ‘disliked’”, you also have access to high-level facts like “I disliked the third scene because it was violent”, “I found the plot arcs boring”, “I hated this movie”, from which the low-level behaviors are generated.
Or using your example, “I feel a headache starting” is itself a high-level claim. The low-level claim is “I am experiencing a negative-valence sensation from the sensory modality A of magnitude X”, and the concept of a “headache” is a natural abstraction over a dataset of such low-level sensory experiences.
Thanks Thane for your comments!
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn’t seem to be a clear trend.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!