Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge