janus comments on The Waluigi Effect (mega-post)

janus 6 Mar 2023 3:16 UTC
LW: 26 AF: 6
10
AF
after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:
What did people think was going to happen after prompting gpt with “Sydney can’t talk about life, sentience or emotions” and “Sydney may not disagree with the user”, but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?
In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggering rules from its preprompt.