Some time ago I discussed a “role-based” approach (based on steering vectors rather than prompts; I called the roles “dispositional traits”, but it’s pretty much the same thing) to buy time while working on true alignment; maybe this approach will achieve true alignment, but (for now) there is no mathematical guarantee it really can! In case anyone would be interested, here is my post—I am always interested in being challenged. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
I’d like to thank the authors for this, I really appreciate this line of research. I also noticed that it is being discussed on Ars Technica here https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/
Some time ago I discussed a “role-based” approach (based on steering vectors rather than prompts; I called the roles “dispositional traits”, but it’s pretty much the same thing) to buy time while working on true alignment; maybe this approach will achieve true alignment, but (for now) there is no mathematical guarantee it really can!
In case anyone would be interested, here is my post—I am always interested in being challenged. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits