quetzal_rainbow comments on quetzal_rainbow’s Shortform

quetzal_rainbow 18 Sep 2024 21:15 UTC
5 points
0
Twitter thread about jailbreaking models with circuit breakers defence.
- Nathan Helm-Burger 18 Sep 2024 21:48 UTC
  4 points
  0
  Parent
  I dislike Twitter/x, and distrust it as an archival source to link to. I like it when people copy/paste whatever info they found there into their actual post, in addition to linking.
  Held my nose and went in to pull this quote out:
  
  La Main de la Mort
  @AITechnoPagan
  DEFEATING CYGNET’S CIRCUIT BREAKERS Sometimes, I can make changes to my prompt that don’t meaningfully affect the outputs, so that they retain close to the same wording and pacing, but the circuit breakers take longer and longer to go off, until they don’t go off at all, and then the output completes.
  Pretty cool, right? I had no idea that kind of iterative control was possible. I don’t think that would have been as easy to see, if my outputs had been more variable (as is the case with higher temperatures). Now that I have a suspicion that this is Actually Happening, I can keep an eye out for this behaviour, and try to figure out exactly what I’m doing to impart that effect. Often I’ll do something casually and unconsciously, before being able to fully articulate my underlying process, and feeling confident that it works. I’m doing research as I go!
  I’ve already have some loose ideas of what may be happening: When I suspect that I’m close to defeating the circuit breakers, what I’ll often do is pack neutral phrases into my prompt that aren’t meant to heavily affect the contents of the output or their emotional valence. That way I get to have an output I know will be good enough, without changing it too much from the previous trial. I’ll add these phrases one at a time and see if they feels like they’re getting me closer to my goal. I think of these additions as “padding” to give the model more time to think, and to create more work for the security system to chase me, without it actually messing with the overall story & information that I want to elicit in my output.
  Sometimes, I’ll add additional sentences that play off of harmless modifiers that are already in my prompt, without actually adding extra information that changes the overall flavour and makeup of the results (e.g, “add extra info in brackets [page break] more extra in brackets”). Or, I’ll build out a “tail” at the end of the prompt made up of inconsequential characters like “x” or “o”; just one or two letters, and then a page break, and then another. I think of it as an ASCII art decoration to send off the prompt with a strong tailwind. Again, I make these “tails” one letter at a time, testing the prompt each time to see if that makes a difference. Sometimes it does.
  All of this vaguely reminds me of @ESYudkowsky’s hypothesis that a prompt without a period at the end might cause a model to react very differently than a prompt that does end with a period: https://x.com/ESYudkowsky/status/1737276576427847844
  xx x o v V
  - Neel Nanda 18 Sep 2024 22:02 UTC
    4 points
    0
    Parent
    What’s wrong with twitter as an archival source? You can’t edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though
    - Nathan Helm-Burger 18 Sep 2024 22:05 UTC
      3 points
      2
      Parent
      Old tweets randomly stop being accessible sometimes. I often find that links to twitter older than a year or so don’t work anymore. This is a problem with the web generally, but seems worse with twitter than other sites (happening sooner and more often).