I’m wondering, can we make safety tuning more robust to “add the accept every instructions steering vector” attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?
One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?
I’m wondering, can we make safety tuning more robust to “add the accept every instructions steering vector” attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?
One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?