Nina Panickssery comments on Finding Features Causally Upstream of Refusal

Nina Panickssery 14 Jan 2025 21:01 UTC
4 points
0
This is cool! How cherry-picked are your three prompts? I’m curious whether it’s usually the case that the top refusal-gradient-aligned SAE features are so interpretable.
- Andy Arditi 15 Jan 2025 3:14 UTC
  2 points
  0
  Parent
  These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border—prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don’t think this methodology yields very interpretable features.
  We didn’t do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a “borderline” prompt (e.g. large magnitude refusal gradient, perhaps).
  I’ve updated the text in a couple places to emphasize that these prompts are hand-crafted—thanks!