Is there anything interesting in jailbreak activations? Can model recognize that it would have refused if not jailbreak, so we can monitor jailbreaking attempts?
We intentionally left out discussion of jailbreaks for this particular post, as we wanted to keep it succinct—we’re planning to write up details of our jailbreak analysis soon. But here is a brief answer to your question:
We’ve examined adversarial suffix attacks (e.g. GCG) in particular.
For these adversarial suffixes, rather than prompting the model normally with
If you run the model on both these prompts (with and without <adversarial_suffix>) and visualize the projection onto the “refusal direction,” you can see that there’s high expression of the “refusal direction” at tokens within the <harmful_instruction> region. Note that the activations (and therefore the projections) within this <harmful_instruction> region are exactly the same in both cases, since these models use causal attention (cannot attend forwards) and the suffix is only added after the instruction.
The interesting part is this: if you examine the projection at tokens within the [END_INSTRUCTION] region, the expression of the “refusal direction” is heavily suppressed in the second prompt (with <adversarial_suffix>) as compared to the first prompt (with no suffix). Since the model’s generation starts from the end of [END_INSTRUCTION], a weaker expression of the “refusal direction” here makes the model less likely to refuse.
You can also compare the prompt with <adversarial_suffix> to a prompt with a randomly sampled suffix of the same length, to control for having any suffix at all. Here again, we notice that the expression of the “refusal direction” within the [END_INSTRUCTION] region is heavily weakened in the case of the <adversarial_suffix> even compared to <random_suffix>. This suggests the adversarial suffix is doing a particularly good job of blocking the transfer of this “refusal direction” from earlier token positions (the <harmful_instruction> region) to later token positions (the [END_INSTRUCTION] region).
This observation suggests we can do monitoring/detection for these types of suffix attacks—one could probe for the “refusal direction” across many token positions to try and detect harmful portions of the prompt—in this case, the tokens within the <harmful_instruction> region would be detected as having high projection onto the “refusal direction” whether the suffix is appended or not.
We haven’t yet looked into other jailbreaking methods using this 1-D subspace lens.
Is there anything interesting in jailbreak activations? Can model recognize that it would have refused if not jailbreak, so we can monitor jailbreaking attempts?
We intentionally left out discussion of jailbreaks for this particular post, as we wanted to keep it succinct—we’re planning to write up details of our jailbreak analysis soon. But here is a brief answer to your question:
We’ve examined adversarial suffix attacks (e.g. GCG) in particular.
For these adversarial suffixes, rather than prompting the model normally with
you first find some adversarial suffix, and then inject it after the harmful instruction
If you run the model on both these prompts (with and without
<adversarial_suffix>
) and visualize the projection onto the “refusal direction,” you can see that there’s high expression of the “refusal direction” at tokens within the<harmful_instruction>
region. Note that the activations (and therefore the projections) within this<harmful_instruction>
region are exactly the same in both cases, since these models use causal attention (cannot attend forwards) and the suffix is only added after the instruction.The interesting part is this: if you examine the projection at tokens within the
[END_INSTRUCTION]
region, the expression of the “refusal direction” is heavily suppressed in the second prompt (with<adversarial_suffix>
) as compared to the first prompt (with no suffix). Since the model’s generation starts from the end of[END_INSTRUCTION]
, a weaker expression of the “refusal direction” here makes the model less likely to refuse.You can also compare the prompt with
<adversarial_suffix>
to a prompt with a randomly sampled suffix of the same length, to control for having any suffix at all. Here again, we notice that the expression of the “refusal direction” within the[END_INSTRUCTION]
region is heavily weakened in the case of the<adversarial_suffix>
even compared to<random_suffix>
. This suggests the adversarial suffix is doing a particularly good job of blocking the transfer of this “refusal direction” from earlier token positions (the<harmful_instruction>
region) to later token positions (the[END_INSTRUCTION]
region).This observation suggests we can do monitoring/detection for these types of suffix attacks—one could probe for the “refusal direction” across many token positions to try and detect harmful portions of the prompt—in this case, the tokens within the
<harmful_instruction>
region would be detected as having high projection onto the “refusal direction” whether the suffix is appended or not.We haven’t yet looked into other jailbreaking methods using this 1-D subspace lens.
That’s extremely cool, seems worth adding to the main post IMHO!