Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 1 May 2024 23:03 UTC
14 points
0
Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.