Wei Shi comments on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Wei Shi 18 Nov 2024 9:07 UTC
1 point
0
AF
We trained a crosscoder of width 16,384 on the residual stream activations from the middle layer of the Gemma-2 2B base and IT models.
I don’t understand the training process here, as well as the mini-paper from Anthropic. How do you train one crosscoder on the residual stream from two different models?
- Neel Nanda 18 Nov 2024 22:00 UTC
  LW: 4 AF: 4
  0
  AF Parent
  It’s essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I’m not getting into)
  - Wei Shi 19 Nov 2024 2:00 UTC
    1 point
    0
    Parent
    I got it, thank you very much!