Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.
Second question is great. We’ve looked into this a bit, and (preliminarily) it seems like it’s the latter (base models learn some “harmful feature,” and this gets hooked into by the safety fine-tuned model). We’ll be doing more diligence on checking this for the paper.