Neel Nanda comments on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Neel Nanda 18 Nov 2024 22:00 UTC
LW: 4 AF: 4
0
AF
It’s essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I’m not getting into)
- Wei Shi 19 Nov 2024 2:00 UTC
  1 point
  0
  Parent
  I got it, thank you very much!