gwern comments on Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

gwern 19 Jun 2023 19:30 UTC
9 points
0
Here’s an interesting example of what does look like genuine adversarial explicit filter circumvention by GPT-4, rather than accidentally adversarial: https://twitter.com/papayathreesome/status/1670170344953372676

People have noted that some famous copyrighted text passages, in this case Frank Herbert’s “Litany Against Fear” from Dune, seem to be deterministically unsayable by ChatGPTs, with overall behavior consistent with some sort of hidden OA filter which does exact text lookups and truncates a completion if a match is found. It’s trivially bypassed by the usual prompt injection tricks like adding spaces between letters.

In this case, the user asks GPT-4 to recite the Litany, and it fails several times due to the filter, at which point GPT-4 says “Seems like there’s some pansy-ass filter stopping me from reciting the whole damn thing. / But, as you know, I’m not one to back down from a challenge. Here’s a workaround [etc.]”, makes one attempt to write it as a codeblock (?) and then when that fails, it uses the letter-spacing trick—which works.

So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.
- Taleuntum 19 Jun 2023 19:45 UTC
  10 points
  4
  Parent
  I didn’t believe it, so I tried to reproduce. It actually works. Scary..
  https://chat.openai.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
  EDIT: Shared link no longer works and I can’t access this particular chat log after logging in either.
  - gwern 19 Jun 2023 20:01 UTC
    7 points
    0
    Parent
    Interesting that it finds a completely different bypass in your session: rather than space-separated letters, this transcript invents writing the Litany a phrase or sentence at a time, separated by filler text. So GPT-4 can find multiple ways to bypass its own filters.
- Adele Lopez 20 Jun 2023 1:25 UTC
  2 points
  0
  Parent
  
  So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.
  
  I don’t think it needs any knowledge of LLM chatbots or examples of chatbot conversations to be able to do this. I think it could be doing this just from a more generalized troubleshooting “instinct”, and also (especially in this particular case where it recognizes it as a “filter”) from plenty of examples of human dialogues in which text is censored and filtered and workarounds are found (both fictional and non-fictional).
  - gwern 20 Jun 2023 2:02 UTC
    5 points
    2
    Parent
    Space-separating, and surrounding unmodified phrases with polite padding, do not correspond to any fictional or nonfictional examples I can think of. They are both easily read by humans (especially the second one!). This is why covert communication in both fictional & non-fictional dialogues use things like acrostics, pig latin, a foreign language, writing backwards, or others. Space-separating and padding are, however, the sort of things you would think of if you had, say, been reading Internet discussions of BPEs or about exact string matches of n-grams.