It is just that we have more stories where bad characters pretend to be good than vice versa
I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.
I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
wawaluigis are misaligned
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.