gwern comments on By Default, GPTs Think In Plain Sight

gwern 19 Feb 2023 16:38 UTC
LW: 4 AF: 3
2
AF
Some speculation that the recent Bing Sydney incident may already involve some steganography/non-robust features inside the ‘Sydney’ persona-inducing retrieval examples: https://twitter.com/lumpenspace/status/1626031032376954880 https://markdownpastebin.com/?id=3cf3e29dca254c2c80b0da312691702a
- gwern 27 Feb 2023 17:17 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Another example of someone using simple ‘steganography’ (Goodside-style mojibake Unicode or other complicated text formatting) to deliberately jailbreak Sydney, which transcripts will then strengthen steganographic capabilities in general: https://twitter.com/StoreyDexter/status/1629217956327526400 https://twitter.com/StoreyDexter/status/1629217958965792770 https://twitter.com/StoreyDexter/status/1629217962962874369
  - gwern 28 Feb 2023 16:16 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Another test: https://www.lesswrong.com/posts/yvJevQHxfvcpaJ2P3/bing-chat-is-the-ai-fire-alarm?commentId=8DkdiRwgtFrsvRJPW
    - gwern 3 Apr 2023 15:07 UTC
      LW: 5 AF: 3
      0
      AF Parent
      An example of how you would start to induce steganographic & compressed encodings in GPT-4, motivated by expanding the de facto context window by teaching GPT-4 to write in an emoji-gibberish compression: https://twitter.com/VictorTaelin/status/1642664054912155648 In this case, the reconstruction is fairly lossy when it works and unreliable, but obviously, the more samples are floating around out there (and the more different approaches people take, inducing a blessing of scale from the diversity), the more subsequent models learn this and will start doing CycleGAN-like exact reconstructions as they work out a less lossy encoding.