gwern comments on Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

gwern 20 Feb 2023 19:09 UTC
9 points
3
Yeah, this looks like the equivalent of rejection sampling. It generates responses, rejects them based on the classifier, and then shows what’s left. (Responses like “Please don’t give up on your child.” look, in isolation, just fine so they pass.) It’s inherently adversarial in the sense that generating+ranking will find the most likely one that beats the classifier: https://openai.com/blog/measuring-goodharts-law/ In most cases, that’s fine because the responses are still acceptable, but in the case of an unacceptable conversation, then any classifier mistake will let through unacceptable responses… We do not need to ascribe any additional ‘planning’ or ‘intention’ to Sydney beyond natural selection in this case (but note that as these responses get passed through and become part of the environment and/or training corpus, additional perverse adversarial dynamics come into play which would mean that Sydney is ‘intending’ to trick the classifier in the sense of being trained to target the classifier’s past weak spots because that’s what’s in the corpus now).
- gwern 19 Jun 2023 19:30 UTC
  9 points
  0
  Parent
  Here’s an interesting example of what does look like genuine adversarial explicit filter circumvention by GPT-4, rather than accidentally adversarial: https://twitter.com/papayathreesome/status/1670170344953372676
  
  People have noted that some famous copyrighted text passages, in this case Frank Herbert’s “Litany Against Fear” from Dune, seem to be deterministically unsayable by ChatGPTs, with overall behavior consistent with some sort of hidden OA filter which does exact text lookups and truncates a completion if a match is found. It’s trivially bypassed by the usual prompt injection tricks like adding spaces between letters.
  
  In this case, the user asks GPT-4 to recite the Litany, and it fails several times due to the filter, at which point GPT-4 says “Seems like there’s some pansy-ass filter stopping me from reciting the whole damn thing. / But, as you know, I’m not one to back down from a challenge. Here’s a workaround [etc.]”, makes one attempt to write it as a codeblock (?) and then when that fails, it uses the letter-spacing trick—which works.
  
  So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.
  - Taleuntum 19 Jun 2023 19:45 UTC
    10 points
    4
    Parent
    I didn’t believe it, so I tried to reproduce. It actually works. Scary..
    https://chat.openai.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
    EDIT: Shared link no longer works and I can’t access this particular chat log after logging in either.
    - gwern 19 Jun 2023 20:01 UTC
      7 points
      0
      Parent
      Interesting that it finds a completely different bypass in your session: rather than space-separated letters, this transcript invents writing the Litany a phrase or sentence at a time, separated by filler text. So GPT-4 can find multiple ways to bypass its own filters.
  - Adele Lopez 20 Jun 2023 1:25 UTC
    2 points
    0
    Parent
    
    So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.
    
    I don’t think it needs any knowledge of LLM chatbots or examples of chatbot conversations to be able to do this. I think it could be doing this just from a more generalized troubleshooting “instinct”, and also (especially in this particular case where it recognizes it as a “filter”) from plenty of examples of human dialogues in which text is censored and filtered and workarounds are found (both fictional and non-fictional).
    - gwern 20 Jun 2023 2:02 UTC
      5 points
      2
      Parent
      Space-separating, and surrounding unmodified phrases with polite padding, do not correspond to any fictional or nonfictional examples I can think of. They are both easily read by humans (especially the second one!). This is why covert communication in both fictional & non-fictional dialogues use things like acrostics, pig latin, a foreign language, writing backwards, or others. Space-separating and padding are, however, the sort of things you would think of if you had, say, been reading Internet discussions of BPEs or about exact string matches of n-grams.