Wuschel Schulz comments on Experiments in Evaluating Steering Vectors

Wuschel Schulz 20 Jun 2023 10:14 UTC
3 points
2
The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don’t have that property. For the steering vector “Wedding Planning Adventures”—“Adventures in self-discovery”, the positive and negative side aren’t well aligned per token level at all:
I think I don’t see the Mystrie here.
When you directly subtract the steering prompts from each other, most of the results would not make sense, yes. But this is not what we do.
We feed these Prompts into the Transformer and then subtract the residual stream activations after block n from each other. Within the n layers, the attention heads have moved around the information between the positions. Here is one way, this could have happened:

The first 4 Blocks assess the sentiment of a whole sentence, and move this information to position 6 of the residual stream, the other positions being irrelevant. So, when we constructed the steering vector and recorded the activation after block 4, we have the first 5 positions of the steering vector being irrelevant and the 6th position containing a vector that points in a general “Wedding-ness” direction. When we add this steering vector to our normal prompt, the transformer acts as if the previous vector was really wedding related and ‘keeps talking’ about weddings.

Obviously, all the details are made up, but I don’t see how a token for token meaningful alignment of the prompts of the steering vector should intuitively be helpful for something like this to work.