Thanks for sharing this research, it’s very promising. I am looking into collecting a list of steering vectors that may “force” a model into behaving safely—and I believe this should be included as well.I’d be grateful if you could challenge my approach in a constructive way!https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Thanks for sharing this research, it’s very promising. I am looking into collecting a list of steering vectors that may “force” a model into behaving safely—and I believe this should be included as well.
I’d be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits