Joseph Bloom comments on Steering GPT-2-XL by adding an activation vector

Joseph Bloom 14 May 2023 13:37 UTC
LW: 4 AF: 2
1
AF
Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can “an injection coefficient scan”.

The procedure I’m using looks like this:
1. Repeat your input tokens say, 128 times.
2. Apply the activation vector at 128 different steps between a coefficient of −10 and 10 to each of your input tokens when doing your AVEC forward pass.
3. Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
4. Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
5. Plot the resulting attribution vs injection coefficient per component.
6. If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.
So far, my results seem very interesting and possibly quite useful. It’s possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.

I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.

I am very excited to see if this makes my analyses easier! Great work!
- TurnTrout 15 May 2023 18:00 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I don’t think I follow your procedure. Would you be willing to walk me through an example situation?
  - Joseph Bloom 15 May 2023 21:54 UTC
    3 points
    0
    Parent
    Sure. Let’s do it at EAG. :)