Sheikh Abdur Raheem Ali comments on [Full Post] Progress Update #1 from the GDM Mech Interp Team

Sheikh Abdur Raheem Ali 19 Apr 2024 21:30 UTC
1 point
0
If you wanted to inject the steering vector into multiple layers, would you need to train an SAE for each layer’s residual stream states?
- Arthur Conmy 21 Apr 2024 16:22 UTC
  2 points
  0
  Parent
  Yes, pretty much.
  There’s some work on transferring steering vecs, e.g. the Llama-2 steering paper (https://arxiv.org/abs/2312.06681) shows that you can transfer steering vecs from base to chat model, and I saw results at a Hackathon once that suggested you can train resid stream SAEs on early layers and transfer them to some later layers, too. But retraining is likely what our follow up work will do (this post only used two different SAEs)