I don’t think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler “solution” would be just to filter the training set. But it’s not an actual solution, because it’s not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).
It is just that we have more stories where bad characters pretend to be good than vice versa
I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.
I don’t think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler “solution” would be just to filter the training set. But it’s not an actual solution, because it’s not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).
I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
wawaluigis are misaligned
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.