Andy Arditi comments on Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi 9 Dec 2023 14:25 UTC
2 points
0
We tried the following:
- On a harmful prompt, patch the refusal heads to their outputs on a harmless prompt.
- On a harmful prompt, ablate the refusal heads.
Neither of these experiments caused the model to bypass refusal—the model still refuses strongly.
This suggests that there are other pathways that trigger refusal. The set of heads we found appear to be sufficient to induce refusal, but not necessary (refusal can be induced even without them).

In the section Suppressing refusal via steering, we do show that we’re able to extract the mean “refusal signal” from these heads, and subtract it in order to bypass refusal.