Daniel_Eth comments on The Waluigi Effect (mega-post)

Daniel_Eth 5 Mar 2023 9:22 UTC
LW: 2 AF: 1
−2
AF
Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).
- Seb Farquhar 22 Mar 2023 15:37 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I’m not sure how serious this suggestion is, but note that:
  1. It involves first training a model to be evil, running it, and hoping that you are good enough at jailbreaking to make it good rather than make it pretend to be good. And then to somehow have that be stable.
  2. The opposite of something really bad is not necessarily good. E.g., the opposite of a paperclip maximiser is… I guess a paperclip minimiser? That seems approximately as bad.
- Qumeric 5 Mar 2023 12:09 UTC
  4 points
  0
  Parent
  I don’t think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler “solution” would be just to filter the training set. But it’s not an actual solution, because it’s not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).
  - Daniel_Eth 8 Mar 2023 5:41 UTC
    1 point
    0
    Parent
    It is just that we have more stories where bad characters pretend to be good than vice versa
    I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.
    - Cleo Nardo 8 Mar 2023 11:12 UTC
      2 points
      0
      Parent
      wawaluigis are misaligned
      
      https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
      
      TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.