TurnTrout comments on Steering GPT-2-XL by adding an activation vector

TurnTrout 15 May 2023 20:32 UTC
LW: 4 AF: 4
0
AF
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don’t see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)
I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.
There’s another strength which I hadn’t mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn’t just naively finetune that honesty into the model.^[1]
But, in a sense, task vectors are “still in the same modalities we’re used to.” Activation additions jolted me because they’re just… a new way^[2] of interacting with models! There’s been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.
1. ^
  This is a kinda sloppy example because “honesty” probably isn’t a primitive property of the network’s reasoning. Sorry.
2. ^
  To be very clear about the novelty of our contributions, I’ll quote the “Summary of relationship to prior work” section:
  We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our “activation addition” methodology enables much faster feedback loops than optimization-based activation vector approaches.
  But this “activation engineering” modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.