Let me make sure I understand your idea correctly:
We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn’t finetuned on the finetuned model activations?)
We then use this model to determine “what direction most closely maps to the activation pattern across input sequences, and how well it maps”.
I’m most unsure about the 2nd step—how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I’m afraid we’ll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And ifwe finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).
Let me make sure I understand your idea correctly:
We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn’t finetuned on the finetuned model activations?)
We then use this model to determine “what direction most closely maps to the activation pattern across input sequences, and how well it maps”.
I’m most unsure about the 2nd step—how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I’m afraid we’ll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).