After one year, it’s been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits If you have the time to read challenge my analysis, I’d be very grateful!
After one year, it’s been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I’d be very grateful!