TurnTrout comments on Steering GPT-2-XL by adding an activation vector

TurnTrout 15 May 2023 20:39 UTC
LW: 9 AF: 4
7
AF
The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion
I personally don’t “dismiss” the task vector work. I didn’t read Thomas as dismissing it by not calling it the concrete work he is most excited about—that seems like a slightly uncharitable read?
I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):
Editing Models with Task Arithmetic explored a “dual” version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors. While our technique modifies activations, the techniques seem complementary, and both useful for alignment.
I’m highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic “superficial stylistic edits” to optimistic “easy activation/deactivation of the model’s priorities at inference time.” In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren’t accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.
I don’t know what world we’re in yet.