Attempts to jailbreak LLMs seem obvious to humans. What does it mean?
Maybe it is a selection bias—a non-obvious jailbreak would simply seem to me like “someone made a valid argument”, so I wouldn’t classify it as a jailbreak attempt.
Is there enough similarity so that we could create an input-checker AI which would only read the input for the purpose of determining whether it is a jailbreak attempt or not… and only if the input is considered okay, it would be passed to the second AI that actually tries to respond to it?
(That is, does the fact that the AI tries to respond to the input make it more vulnerable? As an analogy, imagine a human who is supposed to (a) write a reply to a comment, or (b) verify that the comment is written in English. The comment happens to contain a triggering description, or a mindkilling argument, or some other bad meme. I would expect the person who only verifies the English language to be less impacted, because they interact with the content of the meme less.)
Assuming that it is possible to jailbreak every AI, including the input checker, are there universal jailbreaks that apply to all AIs, or do you need to make a tailored jailbreak for each? Could we increase the resistance by having several different input-checker AIs, and have each input checked by three randomly selected ones?
(It is important that the algorithm that implements “if two out of three AIs say that the content is safe, but one says that it is a jailbreaking attempt, reject the input” is classical, not an LLM—otherwise it would be more efficient to jailbreak this one.)
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It’s hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn’t complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it’s been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.
Attempts to jailbreak LLMs seem obvious to humans. What does it mean?
Maybe it is a selection bias—a non-obvious jailbreak would simply seem to me like “someone made a valid argument”, so I wouldn’t classify it as a jailbreak attempt.
Is there enough similarity so that we could create an input-checker AI which would only read the input for the purpose of determining whether it is a jailbreak attempt or not… and only if the input is considered okay, it would be passed to the second AI that actually tries to respond to it?
(That is, does the fact that the AI tries to respond to the input make it more vulnerable? As an analogy, imagine a human who is supposed to (a) write a reply to a comment, or (b) verify that the comment is written in English. The comment happens to contain a triggering description, or a mindkilling argument, or some other bad meme. I would expect the person who only verifies the English language to be less impacted, because they interact with the content of the meme less.)
Assuming that it is possible to jailbreak every AI, including the input checker, are there universal jailbreaks that apply to all AIs, or do you need to make a tailored jailbreak for each? Could we increase the resistance by having several different input-checker AIs, and have each input checked by three randomly selected ones?
(It is important that the algorithm that implements “if two out of three AIs say that the content is safe, but one says that it is a jailbreaking attempt, reject the input” is classical, not an LLM—otherwise it would be more efficient to jailbreak this one.)
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It’s hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn’t complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it’s been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.