Daniel Paleka comments on Reducing sycophancy and improving honesty via activation steering

Daniel Paleka 24 Aug 2023 9:39 UTC
1 point
0
Do the modified activations “stay in the residual stream” for the next token forward pass?
Is there any difference if they do or don’t?
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn’t matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.
- Nina Panickssery 26 Aug 2023 2:25 UTC
  1 point
  0
  Parent
  I add the steering vector at every token position after the prompt, so in this way, it differs from the original approach in “Steering GPT-2-XL by adding an activation vector”. Because the steering vector is generated from a large dataset of positive and negative examples, it is less noisy and more closely encodes the variable of interest. Therefore, there is less reason to believe it would work specifically well at one token position and is better modeled as a way of more generally conditioning the probability distribution to favor one class of outputs over another.