Nit: this challenge/demo seems to allow only 1 turn of prefill whereas jailbreaks in the wild will typically prefill hundreds of turns. I know mitigations (example) are being worked on and I’m fairly convinced they will scale, but I’m not as convinced that this challenge has gathered a representative sample of jailbreaks eliciting harm in the wild to be able to say that allowing prefill is justified with respect to the costs.
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
Terrific!
Nit: this challenge/demo seems to allow only 1 turn of prefill whereas jailbreaks in the wild will typically prefill hundreds of turns. I know mitigations (example) are being worked on and I’m fairly convinced they will scale, but I’m not as convinced that this challenge has gathered a representative sample of jailbreaks eliciting harm in the wild to be able to say that allowing prefill is justified with respect to the costs.