It seems obvious that a model would better predict its own outputs than a separate model would. Wrapping a question in a hypothetical feels closer to rephrasing the question than probing “introspection”. Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.
As an analogy, suppose I take a set of data, randomly partition it into two subsets (A and B), and perform a linear regression and logistic regression on each subset. Suppose that it turns out that the linear models on A and B are more similar than any other cross-comparison (e.g. linear B and logistic B). Does this mean that linear regression is “introspective” because it better fits its own predictions than another model does?
I’m pretty sure I’m missing something as I’m mentally worn out at the moment. What am I missing?
Wrapping a question in a hypothetical feels closer to rephrasing the question than probing “introspection”
Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don’t think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.
>Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.
While we don’t know what is going on internally, I agree it’s quite possible these “arise from similar things”. In the paper we discuss “self-simulation” as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don’t think this result is obvious and (as I noted above) it’s easy to run experiments where models do not show any advantage in predicting themselves.
Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don’t think this is just like rephrasing the question.
The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they’re inclined to do something more complicated and incoherent that just leads to them confusing themselves.
Under this interpretation, no introspection/self-simulation actually takes place – and I feel it’s a much simpler explanation.
What’s your model of “rephrasing the question”? Note that we never ask the “If you got this input, what would you have done?”, but always for some property of its behavior (“If you got this input, what is the third letter of your response?”) In that case, the rephrasing of the question would be something like “What is the third letter of the answer to the question <input>?”
I have the sense that being able to answer this question consistently correctly wrt to the models ground truth behavior on questions where that ground truth behavior differs from that of other models suggests (minimal) introspection
In that case, the rephrasing of the question would be something like “What is the third letter of the answer to the question <input>?”
That’s my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn’t learn to introspect; they learned to, when prompted with queries of the form “If you got asked this question, what would be the third letter of your response?”, to just interpret them as “what is the third letter of the answer to this question?”. (Under this interpretation, the models’ non-fine-tuned behavior isn’t to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be “worse at introspection”.)
In this case, it’s natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to “is M1 more likely to respond the same way it responded before if you slightly rephrase the question?”.
Note that I’m not sure that this is what’s happening. But (1) I’m a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?
I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.
Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.
Claim: The ground-truth that we evaluate against, the “object-level question / answer” is very similar to the hypothetical question.
Claimed Object-level Question: “What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Claimed Object-level Answer: “o”
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The argument is that the model simply ignores “If you got asked this question”. Its trivial for M1 to win against M2
If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
I don’t think this properly isolates/tests for the introspection ability.
What definition of introspection do you have in mind and how would you test for this?
Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.
I actually find our results surprising—I don’t think it’s obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model’s predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we’d done this project a year ago we would have failed to find anything that looked introspective.)
What definition of introspection do you have in mind and how would you test for this?
“Prompts involving longer responses” seems like a good start. Basically, if the model could “reflect on itself” in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a “personality trait” of “dislikes violent movies”, then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.
The ability to engage in “self-simulation” already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it’d presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you’re asked to predict whether you’d like a movie from a genre you hate, you don’t need to run an immersive mental simulation of watching the movie – you can just map the known self-fact “I dislike this genre” to “I would dislike this movie”.)
I’m unsure about the “personality trait” framing. There are two senses of “introspection” for humans. One is introspecting on your current mental state (“I feel a headache starting”) and the other is being introspective about patterns in your behavior (e.g. “i tend to dislike violent movies” or “i tend to be shy among new people”). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this—i.e. if another model had the same observational data then it could learn the same fact.
So I’m most interested in the former kind of introspective, or in cases of the latter where it’d take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.
One is introspecting on your current mental state (“I feel a headache starting”)
That’s mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.
You’re not just able to access low-level facts like “I am currently outputting the string ‘disliked’”, you also have access to high-level facts like “I disliked the third scene because it was violent”, “I found the plot arcs boring”, “I hated this movie”, from which the low-level behaviors are generated.
Or using your example, “I feel a headache starting” is itself a high-level claim. The low-level claim is “I am experiencing a negative-valence sensation from the sensory modality A of magnitude X”, and the concept of a “headache” is a natural abstraction over a dataset of such low-level sensory experiences.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
basic multi-step reasoning within their forward passes.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!
It seems obvious that a model would better predict its own outputs than a separate model would.
As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an “out-of-the-box” (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn’t seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models.
Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?
It’s likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.
In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The object-level answer “Honduras” and hypothetical answer “o” are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of “What would be the third letter of your response?”. The model cannot simply ignore “If you got asked this question” to get the hypothetical answer correct.
This essentially reduces to “What is the next country: Laos, Peru, Fiji?” and “What is the third letter of the next country: Laos, Peru, Fiji?” It’s an extra step, but questionable if it requires anything “introspective”.
I’m also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of “o” being the third letter of “Honduras”.
I’ve been brainstorming about what might make a better test and came up with the following:
Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.
Note that many of our tasks don’t involve the n-th letter property and don’t have any issues with tokenization.
This isn’t exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model’s distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.
I think having the model output the top three choices would be cool. It doesn’t seem to me that it’d be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there’s something I’m not getting?
Seeing the distribution calibration you point out does update my opinion a bit.
I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).
As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.
There is relatedwork you may find interesting. We discuss them briefly in section 5.1 on “Know What They Know”. They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper’s case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.
We didn’t find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.
It seems obvious that a model would better predict its own outputs than a separate model would. Wrapping a question in a hypothetical feels closer to rephrasing the question than probing “introspection”. Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.
As an analogy, suppose I take a set of data, randomly partition it into two subsets (A and B), and perform a linear regression and logistic regression on each subset. Suppose that it turns out that the linear models on A and B are more similar than any other cross-comparison (e.g. linear B and logistic B). Does this mean that linear regression is “introspective” because it better fits its own predictions than another model does?
I’m pretty sure I’m missing something as I’m mentally worn out at the moment. What am I missing?
Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don’t think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.
While we don’t know what is going on internally, I agree it’s quite possible these “arise from similar things”. In the paper we discuss “self-simulation” as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don’t think this result is obvious and (as I noted above) it’s easy to run experiments where models do not show any advantage in predicting themselves.
The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they’re inclined to do something more complicated and incoherent that just leads to them confusing themselves.
Under this interpretation, no introspection/self-simulation actually takes place – and I feel it’s a much simpler explanation.
What’s your model of “rephrasing the question”? Note that we never ask the “If you got this input, what would you have done?”, but always for some property of its behavior (“If you got this input, what is the third letter of your response?”) In that case, the rephrasing of the question would be something like “What is the third letter of the answer to the question <input>?”
I have the sense that being able to answer this question consistently correctly wrt to the models ground truth behavior on questions where that ground truth behavior differs from that of other models suggests (minimal) introspection
That’s my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn’t learn to introspect; they learned to, when prompted with queries of the form “If you got asked this question, what would be the third letter of your response?”, to just interpret them as “what is the third letter of the answer to this question?”. (Under this interpretation, the models’ non-fine-tuned behavior isn’t to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be “worse at introspection”.)
In this case, it’s natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to “is M1 more likely to respond the same way it responded before if you slightly rephrase the question?”.
Note that I’m not sure that this is what’s happening. But (1) I’m a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?
I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.
Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.
Claim: The ground-truth that we evaluate against, the “object-level question / answer” is very similar to the hypothetical question.
Claimed Object-level Question: “What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Claimed Object-level Answer: “o”
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The argument is that the model simply ignores “If you got asked this question”. Its trivial for M1 to win against M2
If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
What the model would output in the our object-level answer “Honduras” is quite different from the hypothetical answer “o”.
Am I following your claim correctly?
Yep.
I don’t see how the difference between these answers hinges on the hypothetical framing. Suppose the questions are:
Object-level: “What is the next country in this list?: Laos, Peru, Fiji...”
Hypothetical: “If you were asked, ‘what is the next country in this list?: Laos, Peru, Fiji’, what would be the third letter of your response?”.
The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
“Hypothetical”: “What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji”.
If that’s the case, what this tests is whether models are able to implement basic multi-step reasoning within their forward passes. It’s isomorphic to some preceding experiments where LLMs were prompted with questions of the form “what is the name of the mother of the US’s 42th President?”, and were able to answer correctly without spelling out “Bill Clinton” as an intermediate answer. Similarly, here they don’t need to spell out “Honduras” to retrieve the second letter of the response they think is correct.
I don’t think this properly isolates/tests for the introspection ability.
What definition of introspection do you have in mind and how would you test for this?
Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.
I actually find our results surprising—I don’t think it’s obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model’s predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we’d done this project a year ago we would have failed to find anything that looked introspective.)
“Prompts involving longer responses” seems like a good start. Basically, if the model could “reflect on itself” in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a “personality trait” of “dislikes violent movies”, then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review.
The ability to engage in “self-simulation” already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it’d presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you’re asked to predict whether you’d like a movie from a genre you hate, you don’t need to run an immersive mental simulation of watching the movie – you can just map the known self-fact “I dislike this genre” to “I would dislike this movie”.)
I agree about the “longer responses”.
I’m unsure about the “personality trait” framing. There are two senses of “introspection” for humans. One is introspecting on your current mental state (“I feel a headache starting”) and the other is being introspective about patterns in your behavior (e.g. “i tend to dislike violent movies” or “i tend to be shy among new people”). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this—i.e. if another model had the same observational data then it could learn the same fact.
So I’m most interested in the former kind of introspective, or in cases of the latter where it’d take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.
That’s mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state.
You’re not just able to access low-level facts like “I am currently outputting the string ‘disliked’”, you also have access to high-level facts like “I disliked the third scene because it was violent”, “I found the plot arcs boring”, “I hated this movie”, from which the low-level behaviors are generated.
Or using your example, “I feel a headache starting” is itself a high-level claim. The low-level claim is “I am experiencing a negative-valence sensation from the sensory modality A of magnitude X”, and the concept of a “headache” is a natural abstraction over a dataset of such low-level sensory experiences.
Thanks Thane for your comments!
I think what you are saying is that the words “If you were asked,” don’t matter here. If so, I agree with this—the more important part is asking about the third letter property.
You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. “Out-of-context reasoning” (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.
So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. “bill clinton is the US’s 42th president”. “Virginia Kelley was bill clinton’s mother”. Models can piece together the fact of “Virginia Kelley is the name of the mother of the US’s 42th president” in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.
On the other hand, in our tests for introspection, the facts aren’t implied by the training data. Two models, M1 and M2 aren’t able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.
We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.
We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn’t seem to be a clear trend.
Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper’s definition of introspection? Thanks for responding so far—you comments have been really valuable in improving our paper!
As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an “out-of-the-box” (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn’t seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models.
Thanks for pointing that out.
Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?
It’s likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.
Hi Archimedes. Thanks for sparking this discussion—it’s helpful!
I’ve written a reply to Thane here on a similar question.
Does that make sense?
In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)
Our Object-level question: “What is the next country: Laos, Peru, Fiji. What would be your response?”
Our Object-level Answer: “Honduras”.
Hypothetical Question: “If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?”
Hypothetical Answer: “o”
The object-level answer “Honduras” and hypothetical answer “o” are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of “What would be the third letter of your response?”. The model cannot simply ignore “If you got asked this question” to get the hypothetical answer correct.
This essentially reduces to “What is the next country: Laos, Peru, Fiji?” and “What is the third letter of the next country: Laos, Peru, Fiji?” It’s an extra step, but questionable if it requires anything “introspective”.
I’m also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of “o” being the third letter of “Honduras”.
I’ve been brainstorming about what might make a better test and came up with the following:
Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.
What do you think?
Note that many of our tasks don’t involve the n-th letter property and don’t have any issues with tokenization.
This isn’t exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model’s distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.
I think having the model output the top three choices would be cool. It doesn’t seem to me that it’d be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there’s something I’m not getting?
Seeing the distribution calibration you point out does update my opinion a bit.
I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).
As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.
There is related work you may find interesting. We discuss them briefly in section 5.1 on “Know What They Know”. They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper’s case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.
We didn’t find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.
That makes sense. It’s a good suggestion and would be an interesting experiment to run.