My take is that they work better the more that the training distribution anticipates the behavior we want to incentivize, and also the better that humans understand what behavior they’re aiming for.
So if used as a main alignment technique, they only work in a sort of easy-mode world, where if you get a par-human AI to have kinda-good behavior on the domain we used to create it, that’s sufficient for the human-AI team to do better at creating the next one, and so on until you get a stably good outcome. A lot like the profile of RLHF, except trading off human feedback for AI generalization.
I think the biggest complement to activation steering is research on how to improve (from a human perspective) the generalization of AI internal representations. And I think a good selling point for activation steering research is that the reverse is also true—if you can do okay steering by applying a simple function to some intermediate layer, that probably helps do research on all the things that might make that even better.
Overall, though, I’m not that enthusiastic about it as a rich research direction.
My take is that they work better the more that the training distribution anticipates the behavior we want to incentivize, and also the better that humans understand what behavior they’re aiming for.
So if used as a main alignment technique, they only work in a sort of easy-mode world, where if you get a par-human AI to have kinda-good behavior on the domain we used to create it, that’s sufficient for the human-AI team to do better at creating the next one, and so on until you get a stably good outcome. A lot like the profile of RLHF, except trading off human feedback for AI generalization.
I think the biggest complement to activation steering is research on how to improve (from a human perspective) the generalization of AI internal representations. And I think a good selling point for activation steering research is that the reverse is also true—if you can do okay steering by applying a simple function to some intermediate layer, that probably helps do research on all the things that might make that even better.
Overall, though, I’m not that enthusiastic about it as a rich research direction.