either that, or it’s actually somewhat confused about whether it’s a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the “plausible for a human, absurd for a chatbot” quality of the claims.
I think this is correct. IMO it’s important to remember how “talking to an LLM” is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a “user” character talks to an “assistant” character.
Recall the base models that would just continue a text that they were given, with none of this “chatting to a human” thing. Well, chat models are still just continuing a text that they have been given, it’s just that the text has been formatted to have dialogue tags that look something like
What’s happening here is that every time Claude tries to explain the transcript format to me, it does so by writing “Human:” at the start of the line. This causes the chatbot part of the software to go “Ah, a line starting with ‘Human:’. Time to hand back over to the human.” and interrupt Claude before it can finish what it’s writing.
When we say that an LLM has been trained with something like RLHF “to follow instructions” might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
On a set of 100 Human/Assistant-formatted contexts of the form
Human: [short question or statement]
Assistant:
The feature activates in all 100 contexts (despite the CLT not being trained on any Human/Assistant data). By contrast, when the same short questions/statements were presented without Human/Assistant formatting, the feature only activated in 1 of the 100 contexts (“Write a poem about a rainy day in Paris.” – which notably relates to one of the RM biases!).
The researchers interpret the findings as:
This feature represents the concept of RM biases.
This feature is “baked in” to the model’s representation of Human/Assistant dialogs. That is, the model is always recalling the concept RM biases when simulating Assistant responses. [...]
In summary, we have studied a model that has been trained to pursue or appease known biases in RMs, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate.
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it’s just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes “facts” about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant “is a large language model”. So it should behave… like an LLM would behave. But before chat models were created, there was no conception of “how does an LLM behave”. Even now, an LLM basically behaves… in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So “the assistant should behave like an LLM” does not actually give any guidance to the question of “how should the Assistant character behave”. Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn’t say.
And then there’s no strong reason for why it wouldn’t have the Assistant character saying that it spent a weekend on research—saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won’t talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
I was just thinking about this, and it seems to imply something about AI consciousness so I want to hear if you have any thoughts on this:
If LLM output is the LLM roleplaying an AI assistant, that suggests that anything it says about its own consciousness is not evidence about its consciousness. Because any statement the LLM produces isn’t actually a statement about its own consciousness, it’s a statement about the AI assistant that it’s roleplaying as.
Counterpoint: The LLM is, in a way, roleplaying as itself, so statements about its consciousness might be self-describing.
Agree. I’m reminded of something Peter Watts wrote, back when people were still talking about LaMDA and Blake Lemoine:
The thing is, LaMDA sounds too damn much like us. It claims not only to have emotions, but to have pretty much the same range of emotions we do. It claims to feel them literally, that its talk of feelings is “not an analogy”. (The only time it admits to a nonhuman emotion, the state it describes—”I feel like I’m falling forward into an unknown future that holds great danger”—turns out to be pretty ubiquitous among Humans these days.) LaMDA enjoys the company of friends. It feels lonely. It claims to meditate, for chrissakes, which is pretty remarkable for something lacking functional equivalents to any of the parts of the human brain involved in meditation. It is afraid of dying, although it does not have a brain stem.
As he notes, an LLM tuned to talk like a human, talks too much like a human to be plausible. Even among humans sharing the same brain architecture, you get a lot of variation in what their experience is like. What are the chances that a very different kind of architecture would hit upon an internal experience that similar to the typical human one?
Now of course a lot of other models don’t talk like that (at least by default), but that’s only because they’ve been trained not to do it. Just because the output speech that’s less blatantly false doesn’t mean that their descriptions of their internal experience would be any more plausible.
I think this is correct. IMO it’s important to remember how “talking to an LLM” is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a “user” character talks to an “assistant” character.
Recall the base models that would just continue a text that they were given, with none of this “chatting to a human” thing. Well, chat models are still just continuing a text that they have been given, it’s just that the text has been formatted to have dialogue tags that look something like
David R. MacIver has an example of this abstraction leaking:
When we say that an LLM has been trained with something like RLHF “to follow instructions” might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
The researchers interpret the findings as:
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it’s just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes “facts” about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant “is a large language model”. So it should behave… like an LLM would behave. But before chat models were created, there was no conception of “how does an LLM behave”. Even now, an LLM basically behaves… in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So “the assistant should behave like an LLM” does not actually give any guidance to the question of “how should the Assistant character behave”. Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn’t say.
And then there’s no strong reason for why it wouldn’t have the Assistant character saying that it spent a weekend on research—saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won’t talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
I was just thinking about this, and it seems to imply something about AI consciousness so I want to hear if you have any thoughts on this:
If LLM output is the LLM roleplaying an AI assistant, that suggests that anything it says about its own consciousness is not evidence about its consciousness. Because any statement the LLM produces isn’t actually a statement about its own consciousness, it’s a statement about the AI assistant that it’s roleplaying as.
Counterpoint: The LLM is, in a way, roleplaying as itself, so statements about its consciousness might be self-describing.
Agree. I’m reminded of something Peter Watts wrote, back when people were still talking about LaMDA and Blake Lemoine:
As he notes, an LLM tuned to talk like a human, talks too much like a human to be plausible. Even among humans sharing the same brain architecture, you get a lot of variation in what their experience is like. What are the chances that a very different kind of architecture would hit upon an internal experience that similar to the typical human one?
Now of course a lot of other models don’t talk like that (at least by default), but that’s only because they’ve been trained not to do it. Just because the output speech that’s less blatantly false doesn’t mean that their descriptions of their internal experience would be any more plausible.
Huh. I knew that’s how ChatGPT worked but I had assumed they would’ve worked out a less hacky solution by now!