I’d be really interested to see how the harmfulness feature relates to multi-turn jailbreaks! We recently explored splitting a cipher attack into a multi-turn jailbreak (where instead of passing-in the word mappings + the ciphered harmful prompt all at once, you pass in the word mappings, let the model respond, and then pass-in the harmful prompt).
I’d expect to see something like when you “spread out the harm” enough, such that no one prompt contains any blaring red flags, the harmfulness feature never reaches the critical threshold, or something?
I’d be really interested to see how the harmfulness feature relates to multi-turn jailbreaks! We recently explored splitting a cipher attack into a multi-turn jailbreak (where instead of passing-in the word mappings + the ciphered harmful prompt all at once, you pass in the word mappings, let the model respond, and then pass-in the harmful prompt).
I’d expect to see something like when you “spread out the harm” enough, such that no one prompt contains any blaring red flags, the harmfulness feature never reaches the critical threshold, or something?
Scale recently published some great multi-turn work too!