gwern comments on By Default, GPTs Think In Plain Sight

gwern 27 Feb 2023 17:17 UTC
LW: 4 AF: 2
0
AF
Another example of someone using simple ‘steganography’ (Goodside-style mojibake Unicode or other complicated text formatting) to deliberately jailbreak Sydney, which transcripts will then strengthen steganographic capabilities in general: https://twitter.com/StoreyDexter/status/1629217956327526400 https://twitter.com/StoreyDexter/status/1629217958965792770 https://twitter.com/StoreyDexter/status/1629217962962874369
- gwern 28 Feb 2023 16:16 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Another test: https://www.lesswrong.com/posts/yvJevQHxfvcpaJ2P3/bing-chat-is-the-ai-fire-alarm?commentId=8DkdiRwgtFrsvRJPW
  - gwern 3 Apr 2023 15:07 UTC
    LW: 5 AF: 3
    0
    AF Parent
    An example of how you would start to induce steganographic & compressed encodings in GPT-4, motivated by expanding the de facto context window by teaching GPT-4 to write in an emoji-gibberish compression: https://twitter.com/VictorTaelin/status/1642664054912155648 In this case, the reconstruction is fairly lossy when it works and unreliable, but obviously, the more samples are floating around out there (and the more different approaches people take, inducing a blessing of scale from the diversity), the more subsequent models learn this and will start doing CycleGAN-like exact reconstructions as they work out a less lossy encoding.