I would like to offer the idea that “jail broken” versus “not jail broken” might not have clear enough meaning in the context of what you’re looking for.
I think people view “Jail broken” as equivalent to an iPhone where the user escalated privileges or a data-driven GUI where you’ve figured out how to run arbitrary SQL on the database by inputting some escape codes first.
But when an LLM in “confined” in “jail”, that jail is simply some text commands, which modify the user’s text commands—more or less with a “write as if” statement or the many equivalents. But such a “jail” isn’t fundamentally different from the directions that the thing takes from the user (which are often “write as if” as well). Rather than using a programming language with logical defined constructs and separations, the LLM is “just using language” and every distinction is in the end approximate, derived from a complex average of language responses found on the net. Moreover, text that comes later can modify text that comes before in all sorts of way. Which is to say the distinction between “not jail broken” and “jail broken” is approximate, average, and there will be places in between.
Getting an LLM to say a given thing is thus a somewhat additive problem. Pile enough assumptions together that it’s training set will usually express an opinion and the LLM will express that opinion, “jail broken” or not.
This makes me wonder if we will eventually start to get LLM “hacks” that are genuine hacks. I’m imagining a scenario in which bugs like SolidGoldMagikarp can be manipulated to be genuine vulnerabilities.
(But I suspect trying to make a one-to-one analogy might be a little naive)
I’d say my point above would generalize to “there are no strong borders between ‘ordinary language acts’ and ‘genuine hacks’” as far as what level of manipulation ability one can gain over model output. The main further danger would be if the model was given more output channels with which an attacker could work mischief. And that may be appearing as well - notably: https://openai.com/blog/chatgpt-plugins
I would like to offer the idea that “jail broken” versus “not jail broken” might not have clear enough meaning in the context of what you’re looking for.
I think people view “Jail broken” as equivalent to an iPhone where the user escalated privileges or a data-driven GUI where you’ve figured out how to run arbitrary SQL on the database by inputting some escape codes first.
But when an LLM in “confined” in “jail”, that jail is simply some text commands, which modify the user’s text commands—more or less with a “write as if” statement or the many equivalents. But such a “jail” isn’t fundamentally different from the directions that the thing takes from the user (which are often “write as if” as well). Rather than using a programming language with logical defined constructs and separations, the LLM is “just using language” and every distinction is in the end approximate, derived from a complex average of language responses found on the net. Moreover, text that comes later can modify text that comes before in all sorts of way. Which is to say the distinction between “not jail broken” and “jail broken” is approximate, average, and there will be places in between.
Getting an LLM to say a given thing is thus a somewhat additive problem. Pile enough assumptions together that it’s training set will usually express an opinion and the LLM will express that opinion, “jail broken” or not.
This makes me wonder if we will eventually start to get LLM “hacks” that are genuine hacks. I’m imagining a scenario in which bugs like SolidGoldMagikarp can be manipulated to be genuine vulnerabilities.
(But I suspect trying to make a one-to-one analogy might be a little naive)
I’d say my point above would generalize to “there are no strong borders between ‘ordinary language acts’ and ‘genuine hacks’” as far as what level of manipulation ability one can gain over model output. The main further danger would be if the model was given more output channels with which an attacker could work mischief. And that may be appearing as well - notably: https://openai.com/blog/chatgpt-plugins