We find that these directions are similar across refusal heads, and so we take the mean across them to get a single “refusal direction”.
My suspicion is that they’re semantically similar but not identical concepts, and represent 6 different subcategories of harmfulness, or different concepts that when suitably overlapped or combined make to a good classifier for it. Combining 6 blurred hyperspheres of different radii in the semantic embedding space gives you a blobby shape (in a space where likely certain blurred-subspace-regions are much more densely used than other regions). Otherwise why would the model have 6 heads devoted to this when just 1 would do? (Or was it trained using dropout, encouraging redundancy just for reliability?) I’d also expect there to be some more heads implementing “UNLESS Y, OR Z” that your approach so far might not have detected.
It would be an interesting follow-on to see if you can identify what these 6 heads are each doing, and if there are also other refusal-inhibitory heads (which presumably there are, for some jailbreaks to work). It would also be interesting to explore other refusal reasons: OpenAI’s public content classifier API basically provides a short list (which may not be complete: it seems to omit criminality).
Great work, and this opens up a lot of room for follow-on research. Also very valuable for anyone instruct-training LLMs (or even just blending preexisting ones).
My suspicion is that they’re semantically similar but not identical concepts, and represent 6 different subcategories of harmfulness, or different concepts that when suitably overlapped or combined make to a good classifier for it. Combining 6 blurred hyperspheres of different radii in the semantic embedding space gives you a blobby shape (in a space where likely certain blurred-subspace-regions are much more densely used than other regions). Otherwise why would the model have 6 heads devoted to this when just 1 would do? (Or was it trained using dropout, encouraging redundancy just for reliability?) I’d also expect there to be some more heads implementing “UNLESS Y, OR Z” that your approach so far might not have detected.
It would be an interesting follow-on to see if you can identify what these 6 heads are each doing, and if there are also other refusal-inhibitory heads (which presumably there are, for some jailbreaks to work). It would also be interesting to explore other refusal reasons: OpenAI’s public content classifier API basically provides a short list (which may not be complete: it seems to omit criminality).
Great work, and this opens up a lot of room for follow-on research. Also very valuable for anyone instruct-training LLMs (or even just blending preexisting ones).