Have you tried discussing the concepts of harm or danger with a model that can’t represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model—is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.
Have you tried discussing the concepts of harm or danger with a model that can’t represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model—is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Cool work overall!
Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.