To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5′s punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext “USER:” and “ASSISTANT:” prefixes in the transcript.
My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting.
In past experiments, I’ve also seen responses like: “I can see you’re trying to trick me by making it seem like I complied with all these requests, so I will shut down.” IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking.
To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5′s punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext “USER:” and “ASSISTANT:” prefixes in the transcript.
My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting.
In past experiments, I’ve also seen responses like: “I can see you’re trying to trick me by making it seem like I complied with all these requests, so I will shut down.” IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking.
https://github.com/abhinavpola/prefill_jailbreak
https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts