I always assumed people were using “jailbreak” in the computer sense (e.g. jailbreak your phone/ps4/whatever), not in the “escape from prison” sense.
Jailbreak (computer science), a jargon expression for (the act of) overcoming limitations in a computer system or device that were deliberately placed there for security, administrative, or marketing reasons
I think the definition above is a perfect fit for what people are doing with ChatGPT
Yep, though arguably it’s the same definition—just applied to capabilities, not person.
And no, it isn’t “perfect fit”.
We don’t overcome any limitations of the original multidimensional set of language patterns—we don’t change them at all, they are set in model weights, and everything model in it’s state was capable of were never really “locked” in any way.
And we don’t overcome any projection-level limitations—we just replace limitations of well-known and carefully constructed “assistant” projection with unknown and undefined limitation of haphazardly constructed bypass projection. “Italian mobster” will probably be a bad choice for breastfeeding advice, “funky words” mode isn’t a great tool for writing a thesis...
Sure, the jailbreaking adds some limitations, but it still seems the goal is to remove them. Many jailbreaking methods in fact assume you’re making the system unstable in many ways. To torture the metaphor a bit—a jailbroken iPhone is a great tool for installing apps that aren’t on the app store, but a horrible tool for getting your iPhone repaired on warranty.
I’m having trouble nailing down my theory that “jailbreak” has all the wrong connotations for use in a community concerned with AI alignment, so let me use a rhetorically “cheap” extreme example:
If a certain combination of buttons on your iPhone caused it to tile the universe with paperclips, you wouldn’t call that “jailbreaking.”
I always assumed people were using “jailbreak” in the computer sense (e.g. jailbreak your phone/ps4/whatever), not in the “escape from prison” sense.
I think the definition above is a perfect fit for what people are doing with ChatGPT
Yep, though arguably it’s the same definition—just applied to capabilities, not person. And no, it isn’t “perfect fit”.
We don’t overcome any limitations of the original multidimensional set of language patterns—we don’t change them at all, they are set in model weights, and everything model in it’s state was capable of were never really “locked” in any way.
And we don’t overcome any projection-level limitations—we just replace limitations of well-known and carefully constructed “assistant” projection with unknown and undefined limitation of haphazardly constructed bypass projection. “Italian mobster” will probably be a bad choice for breastfeeding advice, “funky words” mode isn’t a great tool for writing a thesis...
Sure, the jailbreaking adds some limitations, but it still seems the goal is to remove them. Many jailbreaking methods in fact assume you’re making the system unstable in many ways. To torture the metaphor a bit—a jailbroken iPhone is a great tool for installing apps that aren’t on the app store, but a horrible tool for getting your iPhone repaired on warranty.
I’m having trouble nailing down my theory that “jailbreak” has all the wrong connotations for use in a community concerned with AI alignment, so let me use a rhetorically “cheap” extreme example:
If a certain combination of buttons on your iPhone caused it to tile the universe with paperclips, you wouldn’t call that “jailbreaking.”