Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits! I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment). I’d happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I’d happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits