Dalcy comments on The Waluigi Effect (mega-post)

Dalcy 3 Mar 2023 11:42 UTC
2 points
0
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
Hm, what if we do the opposite? i.e. Prompt chatbob starting as a pro-croissant simulacrum, and then proceed to collapse the superposition into the anti-croissant simulacrum using a single line of dialogue; behold, we have created a stable Luigi!
I can see how this is more difficult for desirable traits rather than their opposite because fiction usually has the structure of an antagonist appearing after the protagonist (who holds our values), rarely the opposite.
(leaving this comment halfway through—you could’ve mentioned this later in the post)
- Cleo Nardo 3 Mar 2023 12:37 UTC
  7 points
  4
  Parent
  I think this fails — a wawaluigi is not a luigi. See this comment for an explanation:
  
  https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
  TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things. That definitely isn’t a solution to the alignment problem.