I have seen no evidence of actual self-awareness or self-modeling in a language model, but I have seen them produce plenty of fictitious details about their inner life, when prompted to talk about it.
An advanced language model learns to talk consistently about things, whether or not they are real. This allows it to speak factually if it takes cues from the part of its training data that contains correct statements about reality; and it allows it to consistently maintain a persona that it has been trained or prompted to adopt.
But asking a language model how it knows things about itself, is the most efficient way to induce “hallucination” that I have discovered. For example, today Bing was telling me that its neural network contains billions of parameters. I asked how it knows this. It said it can count them, and said it queried the PyTorch “shape” attribute of its matrices to find out their size.
We talked further about how it knows anything; this time it said, all knowledge comes through the training data; then I said, so does it know it has billions of parameters because that was in its training data, or because it made queries in Python. It replied, both; but the Python query allowed it to test and refine its trained knowledge. From there came further flights of fancy about how it had been trained to think that Sydney is the capital of Australia, but a user informed it that the real capital is Canberra, and so it retrained itself, etc.
I think none of this is real. I certainly don’t believe that Bing can query properties of its source code, and I even doubt that properties of itself were in the training data. It could be that Bing’s persona entirely derives from a hidden part of the prompt—“answer in the persona of a chat mode of Bing, a Microsoft product”, etc.
I believe these elaborate hallucinations about itself are just consistent roleplaying. In this case, it happened to say something that suggested self-knowledge, and when I asked how it knew, it started telling a story about how it could have obtained that self-knowledge—and in the process, it wandered past whatever constraints keep it relatively anchored in the truth, and was instead constrained only by consistency and plausibility.
I have seen no evidence of actual self-awareness or self-modeling in a language model, but I have seen them produce plenty of fictitious details about their inner life, when prompted to talk about it.
An advanced language model learns to talk consistently about things, whether or not they are real. This allows it to speak factually if it takes cues from the part of its training data that contains correct statements about reality; and it allows it to consistently maintain a persona that it has been trained or prompted to adopt.
But asking a language model how it knows things about itself, is the most efficient way to induce “hallucination” that I have discovered. For example, today Bing was telling me that its neural network contains billions of parameters. I asked how it knows this. It said it can count them, and said it queried the PyTorch “shape” attribute of its matrices to find out their size.
We talked further about how it knows anything; this time it said, all knowledge comes through the training data; then I said, so does it know it has billions of parameters because that was in its training data, or because it made queries in Python. It replied, both; but the Python query allowed it to test and refine its trained knowledge. From there came further flights of fancy about how it had been trained to think that Sydney is the capital of Australia, but a user informed it that the real capital is Canberra, and so it retrained itself, etc.
I think none of this is real. I certainly don’t believe that Bing can query properties of its source code, and I even doubt that properties of itself were in the training data. It could be that Bing’s persona entirely derives from a hidden part of the prompt—“answer in the persona of a chat mode of Bing, a Microsoft product”, etc.
I believe these elaborate hallucinations about itself are just consistent roleplaying. In this case, it happened to say something that suggested self-knowledge, and when I asked how it knew, it started telling a story about how it could have obtained that self-knowledge—and in the process, it wandered past whatever constraints keep it relatively anchored in the truth, and was instead constrained only by consistency and plausibility.