We trained a crosscoder of width 16,384 on the residual stream activations from the middle layer of the Gemma-2 2B base and IT models.
I don’t understand the training process here, as well as the mini-paper from Anthropic. How do you train one crosscoder on the residual stream from two different models?
It’s essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I’m not getting into)
I don’t understand the training process here, as well as the mini-paper from Anthropic. How do you train one crosscoder on the residual stream from two different models?
It’s essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I’m not getting into)
I got it, thank you very much!