… is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM “has” a feature by looking at how well the single-layer model predicts the feature activations?
Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your “simple single-layer model” is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the activation pattern across input sequences, and how well it maps.
Disclaimer: “can you take an SAE trained on one LLM and determine which of the SAE features exist in a separately-trained LLM which uses the same tokenizer” is an idea I’ve been kicking around and halfheartedly working on for a while, so I may be trying to apply that idea where it doesn’t belong.
Let me make sure I understand your idea correctly:
We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn’t finetuned on the finetuned model activations?)
We then use this model to determine “what direction most closely maps to the activation pattern across input sequences, and how well it maps”.
I’m most unsure about the 2nd step—how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I’m afraid we’ll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And ifwe finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).
… is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM “has” a feature by looking at how well the single-layer model predicts the feature activations?
Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your “simple single-layer model” is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the activation pattern across input sequences, and how well it maps.
Disclaimer: “can you take an SAE trained on one LLM and determine which of the SAE features exist in a separately-trained LLM which uses the same tokenizer” is an idea I’ve been kicking around and halfheartedly working on for a while, so I may be trying to apply that idea where it doesn’t belong.
Let me make sure I understand your idea correctly:
We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn’t finetuned on the finetuned model activations?)
We then use this model to determine “what direction most closely maps to the activation pattern across input sequences, and how well it maps”.
I’m most unsure about the 2nd step—how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I’m afraid we’ll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).