Dan H comments on Steering GPT-2-XL by adding an activation vector

Dan H 13 May 2023 19:55 UTC
LW: 17 AF: 2
4
AF
Could these sorts of posts have more thorough related works sections? It’s usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn’t included in this post, and many minutes later finally found a brief sentence about it in a footnote.
What links here?
- TurnTrout's comment on Steering GPT-2-XL by adding an activation vector by TurnTrout (19 May 2023 12:54 UTC; 26 points)
- Gabe M 14 May 2023 3:16 UTC
  14 points
  8
  Parent
  Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.
- habryka 14 May 2023 18:39 UTC
  LW: 11 AF: 4
  10
  AF Parent
  I don’t understand this comment. I did a quick count of related works that are mentioned in the “Related Works” section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.
  I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of this post seems totally adequate in terms of volume (and also the comments are generally a good place for people to drop links to related work, if they think there is interesting related work missing).
  Also, linking to a related work in a footnote seems totally fine. It is somewhat sad that link-text isn’t searchable by-default, so searching for the relevant arxiv link is harder than it has to be. Might make sense to add some kind of tech solution here.
  - Dan H 14 May 2023 21:24 UTC
    LW: 8 AF: 4
    14
    AF Parent
    Background for people who understandably don’t habitually read full empirical papers:
    Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I’ve seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don’t think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high.
    The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.
    - DanielFilan 14 May 2023 21:48 UTC
      LW: 9 AF: 6
      6
      AF Parent
      I think you might be interpreting the break after the sentence “Their results are further evidence for feature linearity and internal activation robustness in these models.” as the end of the related work section? I’m not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.
      - Dan H 14 May 2023 22:05 UTC
        LW: 13 AF: 3
        1
        AF Parent
        Yes, I was—good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: “That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors.”
        Is this big difference? I really don’t know; it’d be helpful if they’d contrast more. Is this work very novel and useful, and that one isn’t any good for alignment? Or did Ludwig Schmidt (not x-risk pilled) and coauthors in Editing Models with Task Arithmetic (made public last year and is already published) come up with an idea similar to, according to a close observer, “the most impressive concrete achievement in alignment I’ve seen”? If so, what does that say about the need to be x-risk motivated to do relevant research, and what does this say about group epistemics/ability to spot relevant progress if it’s not posted on the AF?
        davidad 15 May 2023 13:25 UTC
        LW: 15 AF: 7
        19
        AF Parent
        On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.
        
        Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.
        Dan H 15 May 2023 14:00 UTC
        LW: 13 AF: 6
        13
        AF Parent
        It’s a good observation that it’s more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)
        Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.
        
        It takes months to write up these works, and since the Schmidt paper was in December, it is not obvious who was first in all senses. The usual standard is to count the time a standard-sized paper first appeared on arXiv, so the most standard sense they are first. (Inside conferences, a paper is considered prior art if it was previously published, not just if it was arXived, but outside most people just keep track of when it was arXived.) Otherwise there are arms race dynamics leading to everyone spamming snippets before doing careful, extensive science.
        davidad 17 May 2023 18:33 UTC
        LW: 5 AF: 3
        2
        AF Parent
        Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.
        habryka 14 May 2023 22:39 UTC
        LW: 5 AF: 2
        0
        AF Parent
        The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.
        E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don’t see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it’s just making a claim about the usual level of detail that related works section tend to go into).
        Dan H 14 May 2023 22:48 UTC
        LW: 3 AF: 2
        2
        AF Parent
        In many of my papers, there aren’t fairly similar works (I strongly prefer to work in areas before they’re popular), so there’s a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.
        The extent of an adequate comparison depends on the relatedness. I’m of course not saying every paper in the related works needs its own paragraph. If they’re fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.
        habryka 14 May 2023 23:14 UTC
        LW: 2 AF: 1
        0
        AF Parent
        If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.
        Yeah, it’s totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic “please have more thorough related work sections” request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).
        Dan H 14 May 2023 23:25 UTC
        LW: 7 AF: 5
        6
        AF Parent
        Yes, I’ll tend to write up comments quickly so that I don’t feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn’t easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been “It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as …”
        What links here?
        TurnTrout's comment on Steering GPT-2-XL by adding an activation vector by TurnTrout (19 May 2023 12:54 UTC; 26 points)
        Raemon 18 May 2023 20:51 UTC
        LW: 2 AF: 5
        12
        AF Parent
        I’m also not able to evaluate the object-level of “was this post missing obvious stuff it’d have been good to improve”, but, something I want to note about my own guess of how an ideal process would go from my current perspective:
        I think it makes more sense to think of posting on LessWrong as “submitting to a journal”, than “publishing a finished paper.” So, the part where some people then comment “hey, this is missing X” is more analogous to the thing where you submit to peer review and they say “hey, you missed X”, then publishing a finished paper in a journal and it missing X.
        I do think a thing LessWrong is missing (or, doesn’t do a good enough job at) is a “here is the actually finished stuff”. I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there’s room to improve that more, and/or have some kind of filter for stuff that’s optimized to meet academic-expectations-in-particular.
        jsteinhardt 18 May 2023 23:24 UTC
        LW: 30 AF: 7
        14
        AF Parent
        I’ll just note that I, like Dan H, find it pretty hard to engage with this post because I can’t tell whether it’s basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn’t really help in this regard.
        
        I’m not sure what you mean about whether the post was “missing something important”, but I do think that you should be pretty worried about LessWrong’s collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he’s being nitpicked by a moderator. It’s not an accident that no one else is bringing these points up—it’s because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.
        Expand this thread
        TurnTrout 19 May 2023 12:54 UTC
        LW: 26 AF: 13
        14
        AF Parent
        I, like Dan H, find it pretty hard to engage with this post because I can’t tell whether it’s basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn’t really help in this regard.
        The answer is: No, our work is very different from that paper. Here’s the paragraph in question:
        Editing Models with Task Arithmetic explored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.
        Here’s one possible improvement:
        Editing Models with Task Arithmetic explored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. Our approach does not modify the weights. Instead, we modify forward passes by adding an activation vector. While their task arithmetic paper seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.
        Does this clarify the situation, or did you find the paragraph unclear for other reasons?
        I do think that you should be pretty worried about LessWrong’s collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he’s being nitpicked by a moderator.
        Can you be specific about what the “important point” is? Is it the potential ambiguity of this post’s explanation of the relevance of the task vector paper? Is it something else?
        Setting aside the matter of the task vector work, I want to also point out that Dan’s original comment was predicated on an understandable misreading of our post. Due to a now-removed piece of ambiguous formatting, he originally thought that our literature review only cited 2 papers. He criticized this post for not citing 10+ references, when in fact it does cite 14 papers in the related work appendix (although some of them were in footnotes, now integrated into the body of the literature review). I don’t consider Habryka pointing that out to be a nitpick, especially since it improved Dan’s understanding of the post.
        I will also note that I (ironically, from your point of view) wrote the prior work section while thinking about your critique that LessWrong posts often have unclear or nonexistent prior work sections. For my part, I think the prior work section is thorough and that it clearly contrasts our work with past work, when appropriate. Of course, I would be grateful for specific feedback on what you found unclear or ambiguous. (This, too, is a standard part of the peer review process.)
        EDIT: After writing this comment, I added subsection headings to the prior work appendix, among other minor edits.
        jsteinhardt 19 May 2023 19:53 UTC
        LW: 65 AF: 33
        24
        AF Parent
        Hi Alex,
        Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it’s one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon’s moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).
        I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won’t realize this. I think it’s good to point out Dan’s initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn’t been addressed. (I also think Dan overfocused on Ludwig’s paper, see below for more of my take on related work.)
        Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):
        * There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.
        * In language, there’s a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I’d expect someone to try soft prompt tuning), but I don’t know for sure.
        * On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.
        I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model’s runtime behavior:
        * You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
        * You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
        * The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I’m pretty sure there’s a paper on averaging together CLIP activations to get controllable generation.
        So I think it’s most accurate to say that you’ve adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like “Activation additions are a new way of interacting with LLMs”, which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa’s response seems to suggest.
        I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven’t been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.
        This is my best attempt to explain where I’m coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don’t think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it’s helpful.
        TurnTrout 22 May 2023 14:09 UTC
        LW: 11 AF: 5
        0
        AF Parent
        Thanks so much, I really appreciate this comment. I think it’ll end up improving this post/the upcoming paper.
        (I might reply later to specific points)
        jsteinhardt 27 May 2023 17:46 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Glad it was helpful!
        awg 19 May 2023 15:50 UTC
        4 points
        0
        Parent
        I think this entire thread shows why it’s kind of silly to hold LessWrong posts to the same standard as a peer reviewed journal submission. There is clearly a lower bar to LessWrong posts than getting into a peer reviewed journal or even arXiv. And that’s fine, that’s as it should be. This is a forum, not a journal.
        That said, I also think this entire thread shows why LessWrong is a very valuable forum, due to its users’ high epistemic standards.
        It’s a balance.
- TurnTrout 15 May 2023 15:42 UTC
  LW: 8 AF: 6
  1
  AF Parent
  Thanks for the feedback. Some related work was “hidden” in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix.
  I also added another paragraph:
  Lastly, Editing Models with Task Arithmetic explored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.
- Bogdan Ionut Cirstea 26 May 2023 8:06 UTC
  2 points
  0
  Parent
  The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on ‘semantic differentials’ and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.