Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!
Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I’ve seen the most interesting results have come from RLHF’d models), then it’s possible that “better” RLH/AIF could obviate the need for them in the general use case, while they could still be useful for specialized purposes.
Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
I included a link to your post in mine, because they are strongly connected.
Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I’ve seen the most interesting results have come from RLHF’d models), then it’s possible that “better” RLH/AIF could obviate the need for them in the general use case, while they could still be useful for specialized purposes.