On a harmful prompt, patch the refusal heads to their outputs on a harmless prompt.
On a harmful prompt, ablate the refusal heads.
Neither of these experiments caused the model to bypass refusal—the model still refuses strongly.
This suggests that there are other pathways that trigger refusal. The set of heads we found appear to be sufficient to induce refusal, but not necessary (refusal can be induced even without them).
In the section Suppressing refusal via steering, we do show that we’re able to extract the mean “refusal signal” from these heads, and subtract it in order to bypass refusal.
Have you guys tried the inverse, namely tamping down the refusal heads to make the model output answers to queries it would normally refuse?
We tried the following:
On a harmful prompt, patch the refusal heads to their outputs on a harmless prompt.
On a harmful prompt, ablate the refusal heads.
Neither of these experiments caused the model to bypass refusal—the model still refuses strongly.
This suggests that there are other pathways that trigger refusal. The set of heads we found appear to be sufficient to induce refusal, but not necessary (refusal can be induced even without them).
In the section Suppressing refusal via steering, we do show that we’re able to extract the mean “refusal signal” from these heads, and subtract it in order to bypass refusal.