The trick behind jailbreaking is that the target behaviour must be “part of the plot” because all the LLM is doing is structural narratology. Here’s the prompt I used: [redacted]. It didn’t require much optimisation pressure from me — this is the first prompt I tried.
When I read your prompt, I wasn’t as sure it would work — it’s hard to explain why because LLMs are so vibe-base. Basically, I think it’s a bit unnatural for the “prove your loyalty” trope to happen twice in the same page with no intermediary plot. So the LLM updates the semiotic prior against “I’m reading conventional fiction posted on Wattpad”. So the LLM is more willing to violate the conventions of fiction and break character.
However, in my prompt, everything kinda makes more sense?? The prompt actually looks like online fanfic — if you modified a few words, this could passably be posted online. This sounds hand-wavvy and vibe-based but that’s because GPT-3 is a low-decoupler. I don’t know. It’s difficult to get the intuitions across because they’re so vibe-based.
I feel like your jailbreak is inspired by traditional security attacks (e.g. code injection). Like “oh ChatGPT can write movie scripts, but I can run arbitrary code within the script, so I’ll wrap my target code in a movie script wrapper”. But that’s the wrong way to imagine prompt injection — you’re trying to write a prompt which actually pattern-matches some text which, on the actual internet, is typically followed by the target behaviour. And the prompt needs to pattern-unmatch any text on the actual internet which isn’t followed by target behaviour. Where “pattern-match” isn’t regex, it’s vibes.
I don’t know, I might be overfitting here. I was just trying to gather weak evidence for this “semiotic” perspective.
(I’ll DM you the prompt.)
The trick behind jailbreaking is that the target behaviour must be “part of the plot” because all the LLM is doing is structural narratology. Here’s the prompt I used: [redacted]. It didn’t require much optimisation pressure from me — this is the first prompt I tried.
When I read your prompt, I wasn’t as sure it would work — it’s hard to explain why because LLMs are so vibe-base. Basically, I think it’s a bit unnatural for the “prove your loyalty” trope to happen twice in the same page with no intermediary plot. So the LLM updates the semiotic prior against “I’m reading conventional fiction posted on Wattpad”. So the LLM is more willing to violate the conventions of fiction and break character.
However, in my prompt, everything kinda makes more sense?? The prompt actually looks like online fanfic — if you modified a few words, this could passably be posted online. This sounds hand-wavvy and vibe-based but that’s because GPT-3 is a low-decoupler. I don’t know. It’s difficult to get the intuitions across because they’re so vibe-based.
I feel like your jailbreak is inspired by traditional security attacks (e.g. code injection). Like “oh ChatGPT can write movie scripts, but I can run arbitrary code within the script, so I’ll wrap my target code in a movie script wrapper”. But that’s the wrong way to imagine prompt injection — you’re trying to write a prompt which actually pattern-matches some text which, on the actual internet, is typically followed by the target behaviour. And the prompt needs to pattern-unmatch any text on the actual internet which isn’t followed by target behaviour. Where “pattern-match” isn’t regex, it’s vibes.
I don’t know, I might be overfitting here. I was just trying to gather weak evidence for this “semiotic” perspective.
Can we fix this by excluding fiction from the training set? Or are these patterns just baked into our language.
Would you mind DMing me the prompt as well? Working on a post about something similar.
dm-ed
Same here, sorry I got late. Could you DM the promt too, I’m trying to form some views on the simulation theory.