For jailbreaking you are trying to learn the policy “Always imitate/generate-from a harmless assistant”, here you are trying to learn “Always imitate safe human”. In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some training technique that causes the model’s posterior on harmful next tokens to be low.
I’m not sure there’s too much else of note about this similarity, but it seemed worth noting because maybe progress on one can help with the other.
Neat idea. I notice that this looks similar to dealing with many-shot jailbreaking:
For jailbreaking you are trying to learn the policy “Always imitate/generate-from a harmless assistant”, here you are trying to learn “Always imitate safe human”. In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some training technique that causes the model’s posterior on harmful next tokens to be low.
I’m not sure there’s too much else of note about this similarity, but it seemed worth noting because maybe progress on one can help with the other.