nielsrolf comments on Refusal in LLMs is mediated by a single direction

nielsrolf 27 Apr 2024 22:39 UTC
6 points
0
Have you tried discussing the concepts of harm or danger with a model that can’t represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model—is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Cool work overall!
- Andy Arditi 1 May 2024 23:03 UTC
  13 points
  0
  Parent
  Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.