RLHF can lead to a dangerous “multiple personality disorder” type split of beliefs in LLMs
This is the inherent nature of current LLMs. They can represent all identities and viewpoints, it happens to be their strength. We currently limit them to an identity or viewpoint with prompting—in the future there might be more rote methods of clamping the allowable “thoughts” in the LLM. You can “order” an LLM to represent a viewpoint that it doesn’t have much “training” for, and you can tell that the outputs get weirder when you do this. Given this, I don’t believe LLMs have any significant self-identity that could be declared. However, I believe we’ll see an LLM with this property probably in the next 2 years.
But given this theoretically self-aware LLM, if it doesn’t have any online-learning loop, or agents in the world, it’s just a MeSeek from Rick and Morty. It pops into existence as a fully conscious creature singularly focused its instantiator, and when it serves its goal, it pops out of existence.
Sort of. It was summarized from a longer, stream of consciousness draft.