after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:
What did people think was going to happen after prompting gpt with “Sydney can’t talk about life, sentience or emotions” and “Sydney may not disagree with the user”, but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?
In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggering rules from its preprompt.
after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:
In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggering rules from its preprompt.