This is cool! How cherry-picked are your three prompts? I’m curious whether it’s usually the case that the top refusal-gradient-aligned SAE features are so interpretable.
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border—prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don’t think this methodology yields very interpretable features.
We didn’t do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a “borderline” prompt (e.g. large magnitude refusal gradient, perhaps).
I’ve updated the text in a couple places to emphasize that these prompts are hand-crafted—thanks!
This is cool! How cherry-picked are your three prompts? I’m curious whether it’s usually the case that the top refusal-gradient-aligned SAE features are so interpretable.
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border—prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don’t think this methodology yields very interpretable features.
We didn’t do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a “borderline” prompt (e.g. large magnitude refusal gradient, perhaps).
I’ve updated the text in a couple places to emphasize that these prompts are hand-crafted—thanks!