I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.
I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.