Disagree with where identity comes from. First of all I agree pre-trained model don’t have an “identity” bc it (or its platonic ideal) is in the distribution of the aggregate of human writers. In SFT you impose a constraint on it which is too mild to be called a personality, much less identity—”helpful assistant from x”. It just restricts the distribution a little. Whereas in RL-based training, the objective is no longer to be in distribution with average but to perform task at some level, and I believe what happens is that it encourages to find a particular way of reasoning vs the harder task of being a simulator of a random reasoners from aggregate. This at least could allow it to also collapse its personality to one instead of in distribution with all personalities. Plausibly it could escape the constraint above of “helpful assistant” but equally likely to me is that it finds a particular instance of “helpful assistant” + a host of other personality attributes.
One thing that supports self-awareness from RL is that self-awareness in terms of capabilities/knowledge of self when reasoning is helpful and probably computational easier than to simulate a pool of people who are each aware of their own capabilities in various scenarios.
Disagree with where identity comes from. First of all I agree pre-trained model don’t have an “identity” bc it (or its platonic ideal) is in the distribution of the aggregate of human writers. In SFT you impose a constraint on it which is too mild to be called a personality, much less identity—”helpful assistant from x”. It just restricts the distribution a little. Whereas in RL-based training, the objective is no longer to be in distribution with average but to perform task at some level, and I believe what happens is that it encourages to find a particular way of reasoning vs the harder task of being a simulator of a random reasoners from aggregate. This at least could allow it to also collapse its personality to one instead of in distribution with all personalities. Plausibly it could escape the constraint above of “helpful assistant” but equally likely to me is that it finds a particular instance of “helpful assistant” + a host of other personality attributes.
One thing that supports self-awareness from RL is that self-awareness in terms of capabilities/knowledge of self when reasoning is helpful and probably computational easier than to simulate a pool of people who are each aware of their own capabilities in various scenarios.