TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 21 Jun 2024 1:43 UTC
LW: 22 AF: 16
2
AF
The authors updated the Scaling Monosemanticity paper. Relevant updates include:
1. In the intro, they added:
Features can be used to steer large models (see e.g. Influence on Behavior). This extends prior work on steering models using other methods (see Related Work).
2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team’s work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.)
3. The comparison results are now in an appendix and are much more hedged, noting they didn’t evaluate properly according to a steering vector baseline.
While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)
- Neel Nanda 21 Jun 2024 13:44 UTC
  LW: 5 AF: 5
  1
  AF Parent
  Oh, that’s great! Kudos to the authors for setting the record straight. I’m glad your work is now appropriately credited