Arthur Conmy comments on Steering GPT-2-XL by adding an activation vector

Arthur Conmy 16 Jul 2023 19:39 UTC
LW: 5 AF: 4
0
AF
> Can we just add in $5$ times the activations for “Love” to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this?

It’s totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here. EDIT: see reply.
I even tested this:

Empirically in TransformerLens the 5*Love and 5*(Love-Hate) additions were basically identical from a blind trial on myself (I found 5*Love more loving 15 times compared to 5*(Love-Hate) more loving 12 times, and I independently rated which generations were more coherent, and both additions were more coherent 13 times. There were several trials where performance on either loving-ness or coherence seemed identical to me).
- TurnTrout 18 Jul 2023 23:55 UTC
  LW: 4 AF: 3
  0
  AF Parent
  We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.
  TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.
  Oh, somehow I’m not familiar with this. Is this center_unembed? Or are you talking about something else?
  Do you have evidence for this?
  Yes, but I think the evidence didn’t actually come from the “Love”—“Hate” prompt pair. Early in testing we found paired activation additions worked better. I don’t have a citeable experiment off-the-cuff, though.
  - Arthur Conmy 19 Jul 2023 1:11 UTC
    LW: 3 AF: 2
    0
    AF Parent
    No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight
    
    This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
    - TurnTrout 19 Jul 2023 18:32 UTC
      LW: 4 AF: 3
      0
      AF Parent
      I still don’t follow. Apparently, TL’s center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn’t affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.
      - Arthur Conmy 22 Jul 2023 2:35 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!