Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
Hm, what if we do the opposite? i.e. Prompt chatbob starting as a pro-croissant simulacrum, and then proceed to collapse the superposition into the anti-croissant simulacrum using a single line of dialogue; behold, we have created a stable Luigi!
I can see how this is more difficult for desirable traits rather than their opposite because fiction usually has the structure of an antagonist appearing after the protagonist (who holds our values), rarely the opposite.
(leaving this comment halfway through—you could’ve mentioned this later in the post)
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things. That definitely isn’t a solution to the alignment problem.
Hm, what if we do the opposite? i.e. Prompt chatbob starting as a pro-croissant simulacrum, and then proceed to collapse the superposition into the anti-croissant simulacrum using a single line of dialogue; behold, we have created a stable Luigi!
I can see how this is more difficult for desirable traits rather than their opposite because fiction usually has the structure of an antagonist appearing after the protagonist (who holds our values), rarely the opposite.
(leaving this comment halfway through—you could’ve mentioned this later in the post)
I think this fails — a wawaluigi is not a luigi. See this comment for an explanation:
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
TLDR: if I said “hey this is Bob, he pretends to be harmful and toxic!”, what would you expect from Bob? Probably a bunch of terrible things. That definitely isn’t a solution to the alignment problem.