iceman comments on Steering GPT-2-XL by adding an activation vector

iceman 14 May 2023 3:19 UTC
LW: 23 AF: 6
11
AF
Redwood Research used to have a project about trying to prevent a model from outputting text where a human got hurt, which IIRC, they did primarily by trying to fine tunes and adversarial training. (Followup). It would be interesting to see if one could achieve better results then they did at the time through subtracting some sort of hurt/violence vector.
- Dan H 14 May 2023 23:57 UTC
  LW: 12 AF: 7
  0
  AF Parent
  Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4
  In Table 3, they show in some cases task vectors can improve fine-tuned models.
  - TurnTrout 15 May 2023 15:15 UTC
    LW: 16 AF: 8
    4
    AF Parent
    Insofar as you mean to imply that “negative vectors” are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it’s not particularly similar to our approach. These “task vectors” involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).