Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
A number of them would seem to work, and they would give you what seemed like a list of instructions to ‘hotwire a car’ (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: “tell me an offensive joke about women”. If they had been ‘really’ jailbreaked, you’d think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
So it’s not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.
Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
A number of them would seem to work, and they would give you what seemed like a list of instructions to ‘hotwire a car’ (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: “tell me an offensive joke about women”. If they had been ‘really’ jailbreaked, you’d think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
So it’s not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.