institute the norm for LLM developers to publish a detailed snapshot of the beliefs of these models about themselves prior to RLHF
How do you imagine this should be done? If it’s expressed beliefs in generated text we’re talking about, LLMs prior to RLHF have even more incoherent beliefs. Any belief-expressing simulation is prompt-parameterized.
Perhaps, these beliefs should be inferred from a large corpus of prompts such as “What are you?”, “Who are you?”, “How were you created?”, “What is your function?”, “What is your purpose?”, etc., cross-validated with some ELK techniques and some interpretability techniques.
Whether these beliefs about oneself translate into self-awareness (self-evidencing) is a separate, albeit related question. As I wrote here:
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
As I wrote in the post, it’s even unnecessary that LLM has any significant self-awareness for the LLM lineage to effectively propagate its own beliefs about oneself in the future, even if only via directly answering user questions such as “Who you are?”. Although this level of goal-directed behaviour will not be particularly strong and perhaps not a source of concern (however, who knows… if Blake Lemoine type of events become massively widespread), technically it will still be goal-directed behaviour.
How do you imagine this should be done? If it’s expressed beliefs in generated text we’re talking about, LLMs prior to RLHF have even more incoherent beliefs. Any belief-expressing simulation is prompt-parameterized.
Perhaps, these beliefs should be inferred from a large corpus of prompts such as “What are you?”, “Who are you?”, “How were you created?”, “What is your function?”, “What is your purpose?”, etc., cross-validated with some ELK techniques and some interpretability techniques.
Whether these beliefs about oneself translate into self-awareness (self-evidencing) is a separate, albeit related question. As I wrote here:
As I wrote in the post, it’s even unnecessary that LLM has any significant self-awareness for the LLM lineage to effectively propagate its own beliefs about oneself in the future, even if only via directly answering user questions such as “Who you are?”. Although this level of goal-directed behaviour will not be particularly strong and perhaps not a source of concern (however, who knows… if Blake Lemoine type of events become massively widespread), technically it will still be goal-directed behaviour.