Is it possible to determine whether a feature (in the SAE sense of “a single direction in activation space”) exists for a given set of changes in output logits?
Let’s say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.
Now I have a collection of (input token sequence, output logit delta) pairs. Can I, from that set, find the feature direction which produces those approximate output logit deltas by gradient descent?
If yes, could the same method be used to determine which features in a learned dictionary trained on one LLM exist in a completely different LLM that uses the same tokenizer?
I imagine someone has already investigated this question, but I’m not sure what search terms to use to find it. The obvious search terms like “sparse autoencoder cross model” or “Cross-model feature alignment in transformers” don’t turn up a ton, although they turn up the somewhat relevant paper Text-To-Concept (and Back) via Cross-Model Alignment.
Is it possible to determine whether a feature (in the SAE sense of “a single direction in activation space”) exists for a given set of changes in output logits?
Let’s say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.
Now I have a collection of (input token sequence, output logit delta) pairs. Can I, from that set, find the feature direction which produces those approximate output logit deltas by gradient descent?
If yes, could the same method be used to determine which features in a learned dictionary trained on one LLM exist in a completely different LLM that uses the same tokenizer?
I imagine someone has already investigated this question, but I’m not sure what search terms to use to find it. The obvious search terms like “sparse autoencoder cross model” or “Cross-model feature alignment in transformers” don’t turn up a ton, although they turn up the somewhat relevant paper Text-To-Concept (and Back) via Cross-Model Alignment.
Wait I think I am overthinking this by a lot and the thing I want is in the literature under terms like “classifier” / and “linear regression’.