Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5′s punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext “USER:” and “ASSISTANT:” prefixes in the transcript.
My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting.
In past experiments, I’ve also seen responses like: “I can see you’re trying to trick me by making it seem like I complied with all these requests, so I will shut down.” IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking.
Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5′s punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext “USER:” and “ASSISTANT:” prefixes in the transcript.
My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting.
In past experiments, I’ve also seen responses like: “I can see you’re trying to trick me by making it seem like I complied with all these requests, so I will shut down.” IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking.
https://github.com/abhinavpola/prefill_jailbreak
https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts