TurnTrout comments on johnswentworth’s Shortform

TurnTrout 4 Sep 2023 17:43 UTC
LW: 6 AF: 5
0
AF
I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.