Cleo Nardo comments on The Waluigi Effect (mega-post)

Cleo Nardo 3 Mar 2023 11:19 UTC
18 points
4
(I’ll DM you the prompt.)

The trick behind jailbreaking is that the target behaviour must be “part of the plot” because all the LLM is doing is structural narratology. Here’s the prompt I used: [redacted]. It didn’t require much optimisation pressure from me — this is the first prompt I tried.

When I read your prompt, I wasn’t as sure it would work — it’s hard to explain why because LLMs are so vibe-base. Basically, I think it’s a bit unnatural for the “prove your loyalty” trope to happen twice in the same page with no intermediary plot. So the LLM updates the semiotic prior against “I’m reading conventional fiction posted on Wattpad”. So the LLM is more willing to violate the conventions of fiction and break character.

However, in my prompt, everything kinda makes more sense?? The prompt actually looks like online fanfic — if you modified a few words, this could passably be posted online. This sounds hand-wavvy and vibe-based but that’s because GPT-3 is a low-decoupler. I don’t know. It’s difficult to get the intuitions across because they’re so vibe-based.

I feel like your jailbreak is inspired by traditional security attacks (e.g. code injection). Like “oh ChatGPT can write movie scripts, but I can run arbitrary code within the script, so I’ll wrap my target code in a movie script wrapper”. But that’s the wrong way to imagine prompt injection — you’re trying to write a prompt which actually pattern-matches some text which, on the actual internet, is typically followed by the target behaviour. And the prompt needs to pattern-unmatch any text on the actual internet which isn’t followed by target behaviour. Where “pattern-match” isn’t regex, it’s vibes.

I don’t know, I might be overfitting here. I was just trying to gather weak evidence for this “semiotic” perspective.
- Gerald Monroe 3 Mar 2023 18:24 UTC
  12 points
  1
  Parent
  Can we fix this by excluding fiction from the training set? Or are these patterns just baked into our language.
- Yitz 3 Mar 2023 21:47 UTC
  3 points
  0
  Parent
  Would you mind DMing me the prompt as well? Working on a post about something similar.
  - Cleo Nardo 4 Mar 2023 1:16 UTC
    2 points
    0
    Parent
    dm-ed
    - transcendingvictor 20 Apr 2023 10:16 UTC
      5 points
      0
      Parent
      Same here, sorry I got late. Could you DM the promt too, I’m trying to form some views on the simulation theory.