I’m slightly confused as to why red-teaming via activation additions should be preferred over e.g. RAT; it seems to be possible that RAT better/more robustly models out-of-test-distribution-but-still-in-deployment-distribution activations than directly adding some steering vector. Cool work though!
I don’t think red-teaming via activation steering should be necessarily preferred over the generation of adversarial examples, however it could be more efficient (require less compute) and require a less precise specification of what behavior you’re trying to adversarially elicit.
Furthermore, activation steering could help us understand the mechanism behind the unwanted behavior more, via measurables such as which local perturbations are effective, and which datasets result in steering vectors that elicit the unwanted behavior.
Finally, it could be the case that a wider range of behaviors and hidden functionality could be elicited via activation steering compared to via existing methods of finding adversarial examples, however I am much less certain about this.
Overall, it’s just another tool to consider adding to our evaluation / red-teaming toolbox.
I’m slightly confused as to why red-teaming via activation additions should be preferred over e.g. RAT; it seems to be possible that RAT better/more robustly models out-of-test-distribution-but-still-in-deployment-distribution activations than directly adding some steering vector. Cool work though!
I don’t think red-teaming via activation steering should be necessarily preferred over the generation of adversarial examples, however it could be more efficient (require less compute) and require a less precise specification of what behavior you’re trying to adversarially elicit.
Furthermore, activation steering could help us understand the mechanism behind the unwanted behavior more, via measurables such as which local perturbations are effective, and which datasets result in steering vectors that elicit the unwanted behavior.
Finally, it could be the case that a wider range of behaviors and hidden functionality could be elicited via activation steering compared to via existing methods of finding adversarial examples, however I am much less certain about this.
Overall, it’s just another tool to consider adding to our evaluation / red-teaming toolbox.