jsteinhardt comments on Steering GPT-2-XL by adding an activation vector

jsteinhardt May 19, 2023, 7:53 PM
LW: 65 AF: 33
24
AF
Hi Alex,
Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it’s one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon’s moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).
I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won’t realize this. I think it’s good to point out Dan’s initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn’t been addressed. (I also think Dan overfocused on Ludwig’s paper, see below for more of my take on related work.)
Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):
* There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.
* In language, there’s a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I’d expect someone to try soft prompt tuning), but I don’t know for sure.
* On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.
I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model’s runtime behavior:
* You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
* You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
* The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I’m pretty sure there’s a paper on averaging together CLIP activations to get controllable generation.
So I think it’s most accurate to say that you’ve adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like “Activation additions are a new way of interacting with LLMs”, which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa’s response seems to suggest.
I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven’t been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.
This is my best attempt to explain where I’m coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don’t think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it’s helpful.
- TurnTrout May 22, 2023, 2:09 PM
  LW: 11 AF: 5
  0
  AF Parent
  Thanks so much, I really appreciate this comment. I think it’ll end up improving this post/the upcoming paper.
  (I might reply later to specific points)
  - jsteinhardt May 27, 2023, 5:46 PM
    LW: 2 AF: 1
    0
    AF Parent
    Glad it was helpful!