That’s great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I’ve read only a little bit of the literature, so maybe I’ll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it’s important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM’s research could get more real world grounding? (Or do you think that it’s all well and good as it stands?)
That’s great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I’ve read only a little bit of the literature, so maybe I’ll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it’s important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM’s research could get more real world grounding? (Or do you think that it’s all well and good as it stands?)