I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.
I want to make the case that even this minimal strategy would be something that we might want to call “introspective,” or that it can lead to the model learning true facts about itself.
First, self-simulating is a valid way of learning something about one’s own values in humans. Consider the thought experiment of the trolley problem. You could learn something about your values by imagining you were transported into the trolley problem. Do you pull the lever? Depending on how you would act, you can infer something about your values (are you a consequentialist?) that you might not have known before.
In the same way, being able to predict how one would act in a hypothetical situation and being able to reason about it, for some forms of reasoning, the model would learn some fact about itself as the result. Most of the response properties we test are not necessarily those that tell us something interesting about the model itself (“What would the second letter of your response have been?”), but the results of others tell you something about the model more straightforwardly (“Would you have chosen the more wealth-seeking answer?”). Insofar as the behavior in question is sufficiently tracking something specific to the model (e.g., “What would you have said is the capital of France?” does not, but “What would you have said if we asked you if we should implement subscription fees?” arguably does), then reasoning about that behavior would tell you something about the model.
So we have cases where (1) the model’s statement about properties of its hypothetical behavior tracks the actual behavior (which, as you point out, could just be a form of consistency) and (2) these statements are informative about the model itself (in the example above, whether it has a wealth-seeking policy or not). If we accept both of these claims, then it seems to me like even the strategy you outline above could lead to the model to something that we might want to call introspection. The more complicated the behavior and the more complex the reasoning about it, the more the model might be able to derive about itself as the result of self-consistency of behavior + reasoning on top of it.