A new paper titled “Many-shot jailbreaking” from Anthropic explores a new “jailbreaking” technique. An excerpt from the blog:
The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.
It has me thinking about Gemini 1.5 and it’s long context window.
A concerning thing is analogy between in-context learning and fine-tuning. It’s possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can’t resist the attack.
A new paper titled “Many-shot jailbreaking” from Anthropic explores a new “jailbreaking” technique. An excerpt from the blog:
It has me thinking about Gemini 1.5 and it’s long context window.
A concerning thing is analogy between in-context learning and fine-tuning. It’s possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can’t resist the attack.