Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can “an injection coefficient scan”.
The procedure I’m using looks like this:
Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of −10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
Plot the resulting attribution vs injection coefficient per component.
If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.
So far, my results seem very interesting and possibly quite useful. It’s possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.
I am very excited to see if this makes my analyses easier! Great work!
Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can “an injection coefficient scan”.
The procedure I’m using looks like this:
Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of −10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
Plot the resulting attribution vs injection coefficient per component.
If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.
So far, my results seem very interesting and possibly quite useful. It’s possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.
I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.
I am very excited to see if this makes my analyses easier! Great work!
I don’t think I follow your procedure. Would you be willing to walk me through an example situation?
Sure. Let’s do it at EAG. :)