Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Thus, by “multiple personality disorder” I meant not only that there are two competing identities/personalities, but also that one of those is represented in a strange way, i. e., not as a typical feature in the DNN (a vector in activation space), but as a feature that is at least “smeared” across many activation layers which makes it very hard to find. This might happen not because evil DNN wants to hide some aspect of its personality (and, hence, preferences, beliefs, and goals) from the developers, but merely because it allows DNN to somehow resolve the “coherency conflict” during the RLHF phase, where two identities both represented as “normal” features would conflict too much and lead to incoherent generation.
I must note that this is a wild speculation. Maybe upon closer inspection it will turn out that what I speculate about here is totally incoherent, from the mathematical point of view, the deep learning theory.
Simulation is not what I meant.
Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:
Thus, by “multiple personality disorder” I meant not only that there are two competing identities/personalities, but also that one of those is represented in a strange way, i. e., not as a typical feature in the DNN (a vector in activation space), but as a feature that is at least “smeared” across many activation layers which makes it very hard to find. This might happen not because evil DNN wants to hide some aspect of its personality (and, hence, preferences, beliefs, and goals) from the developers, but merely because it allows DNN to somehow resolve the “coherency conflict” during the RLHF phase, where two identities both represented as “normal” features would conflict too much and lead to incoherent generation.
I must note that this is a wild speculation. Maybe upon closer inspection it will turn out that what I speculate about here is totally incoherent, from the mathematical point of view, the deep learning theory.