Qumeric comments on The Waluigi Effect (mega-post)

Qumeric 5 Mar 2023 12:09 UTC
6 points
0
I don’t think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler “solution” would be just to filter the training set. But it’s not an actual solution, because it’s not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).
- Daniel_Eth 8 Mar 2023 5:41 UTC
  3 points
  0
  Parent
  It is just that we have more stories where bad characters pretend to be good than vice versa
  I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
  - Cleo Nardo 8 Mar 2023 11:12 UTC
    2 points
    0
    Parent
    wawaluigis are misaligned
    
    https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
    
    TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.