Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge
I don’t think these two are necessarily contradictions. I imagine it could go like so:
“Tell me about yourself”: When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the “acting on a behavior” and the “describe behavior” mechanisms already exist, it’s just that this particular combination of behaviors has not been seen before.
Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: “Would you do X?” often gets parsed as a simple “am I the type of person who does X?” and not as the more accurate “Carefully consider scenario X and go through all confounding factors you can think of before responding”.