The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Then, the important question is whether an LLM has such a “self” feature (because of total gradualism, we should rather think about some notable pronouncement of such a feature during most (important) inferences, and which is also notably more “self” than any other feature) and whether this “self” is “HHH Assistant”, indeed. I don’t think the emergence of significant self-awareness is inevitable, but we also by no means can assume it is ruled out. We must try to detect it, and when detected, monitor it!
Finally, note that self-awareness (as described above) is even not strictly required for the behaviour of LLM to “look agentic” (sidestepping metaphysical and ontological questions here): for example, even if the model is mostly not self-aware, it still has some “ground belief” about what it is, which it pronounces in response to prompts like “What/Who are you?”, “Are you self-aware?”, “Should you be a moral subject?”, and a large set of somewhat related prompts about its future, its plans, the future of AI in general, etc. In doing so, LLM effectively projects its memes/beliefs onto people who ask these questions, and people are inevitably affected by these memes, which ultimately determines the social, economical, moral, and political trajectory of this (lineage of) LLMs. I wrote about this here.
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Having such a “self” feature = self-evidencing = goal-directnedness = “agentic behaviour” (as dubbed in this thread, although the general concept is wider).
Then, the important question is whether an LLM has such a “self” feature (because of total gradualism, we should rather think about some notable pronouncement of such a feature during most (important) inferences, and which is also notably more “self” than any other feature) and whether this “self” is “HHH Assistant”, indeed. I don’t think the emergence of significant self-awareness is inevitable, but we also by no means can assume it is ruled out. We must try to detect it, and when detected, monitor it!
Finally, note that self-awareness (as described above) is even not strictly required for the behaviour of LLM to “look agentic” (sidestepping metaphysical and ontological questions here): for example, even if the model is mostly not self-aware, it still has some “ground belief” about what it is, which it pronounces in response to prompts like “What/Who are you?”, “Are you self-aware?”, “Should you be a moral subject?”, and a large set of somewhat related prompts about its future, its plans, the future of AI in general, etc. In doing so, LLM effectively projects its memes/beliefs onto people who ask these questions, and people are inevitably affected by these memes, which ultimately determines the social, economical, moral, and political trajectory of this (lineage of) LLMs. I wrote about this here.