Dan H comments on Steering GPT-2-XL by adding an activation vector

Dan H 14 May 2023 23:25 UTC
LW: 7 AF: 5
6
AF
Yes, I’ll tend to write up comments quickly so that I don’t feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn’t easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been “It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as …”
- Raemon 18 May 2023 20:51 UTC
  LW: 2 AF: 5
  12
  AF Parent
  I’m also not able to evaluate the object-level of “was this post missing obvious stuff it’d have been good to improve”, but, something I want to note about my own guess of how an ideal process would go from my current perspective:
  I think it makes more sense to think of posting on LessWrong as “submitting to a journal”, than “publishing a finished paper.” So, the part where some people then comment “hey, this is missing X” is more analogous to the thing where you submit to peer review and they say “hey, you missed X”, then publishing a finished paper in a journal and it missing X.
  I do think a thing LessWrong is missing (or, doesn’t do a good enough job at) is a “here is the actually finished stuff”. I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there’s room to improve that more, and/or have some kind of filter for stuff that’s optimized to meet academic-expectations-in-particular.
  - jsteinhardt 18 May 2023 23:24 UTC
    LW: 30 AF: 7
    14
    AF Parent
    I’ll just note that I, like Dan H, find it pretty hard to engage with this post because I can’t tell whether it’s basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn’t really help in this regard.
    
    I’m not sure what you mean about whether the post was “missing something important”, but I do think that you should be pretty worried about LessWrong’s collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he’s being nitpicked by a moderator. It’s not an accident that no one else is bringing these points up—it’s because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.
    - TurnTrout 19 May 2023 12:54 UTC
      LW: 26 AF: 13
      14
      AF Parent
      I, like Dan H, find it pretty hard to engage with this post because I can’t tell whether it’s basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn’t really help in this regard.
      The answer is: No, our work is very different from that paper. Here’s the paragraph in question:
      Editing Models with Task Arithmetic explored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.
      Here’s one possible improvement:
      Editing Models with Task Arithmetic explored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. Our approach does not modify the weights. Instead, we modify forward passes by adding an activation vector. While their task arithmetic paper seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.
      Does this clarify the situation, or did you find the paragraph unclear for other reasons?
      I do think that you should be pretty worried about LessWrong’s collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he’s being nitpicked by a moderator.
      Can you be specific about what the “important point” is? Is it the potential ambiguity of this post’s explanation of the relevance of the task vector paper? Is it something else?
      Setting aside the matter of the task vector work, I want to also point out that Dan’s original comment was predicated on an understandable misreading of our post. Due to a now-removed piece of ambiguous formatting, he originally thought that our literature review only cited 2 papers. He criticized this post for not citing 10+ references, when in fact it does cite 14 papers in the related work appendix (although some of them were in footnotes, now integrated into the body of the literature review). I don’t consider Habryka pointing that out to be a nitpick, especially since it improved Dan’s understanding of the post.
      I will also note that I (ironically, from your point of view) wrote the prior work section while thinking about your critique that LessWrong posts often have unclear or nonexistent prior work sections. For my part, I think the prior work section is thorough and that it clearly contrasts our work with past work, when appropriate. Of course, I would be grateful for specific feedback on what you found unclear or ambiguous. (This, too, is a standard part of the peer review process.)
      EDIT: After writing this comment, I added subsection headings to the prior work appendix, among other minor edits.
      - jsteinhardt 19 May 2023 19:53 UTC
        LW: 65 AF: 33
        24
        AF Parent
        Hi Alex,
        Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it’s one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon’s moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).
        I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won’t realize this. I think it’s good to point out Dan’s initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn’t been addressed. (I also think Dan overfocused on Ludwig’s paper, see below for more of my take on related work.)
        Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):
        * There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.
        * In language, there’s a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I’d expect someone to try soft prompt tuning), but I don’t know for sure.
        * On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.
        I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model’s runtime behavior:
        * You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
        * You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
        * The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I’m pretty sure there’s a paper on averaging together CLIP activations to get controllable generation.
        So I think it’s most accurate to say that you’ve adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like “Activation additions are a new way of interacting with LLMs”, which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa’s response seems to suggest.
        I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven’t been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.
        This is my best attempt to explain where I’m coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don’t think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it’s helpful.
        TurnTrout 22 May 2023 14:09 UTC
        LW: 11 AF: 5
        0
        AF Parent
        Thanks so much, I really appreciate this comment. I think it’ll end up improving this post/the upcoming paper.
        (I might reply later to specific points)
        jsteinhardt 27 May 2023 17:46 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Glad it was helpful!
      - awg 19 May 2023 15:50 UTC
        4 points
        0
        Parent
        I think this entire thread shows why it’s kind of silly to hold LessWrong posts to the same standard as a peer reviewed journal submission. There is clearly a lower bar to LessWrong posts than getting into a peer reviewed journal or even arXiv. And that’s fine, that’s as it should be. This is a forum, not a journal.
        That said, I also think this entire thread shows why LessWrong is a very valuable forum, due to its users’ high epistemic standards.
        It’s a balance.