Implicitly, SAEs are trying to model activations across a context window, shaped like (n_ctx, n_emb). But today’s SAEs ignore the token axis, modeling such activations as lists of n_ctx IID samples from a distribution over n_emb-dim vectors.
I suspect SAEs could be much better (at reconstruction, sparsity and feature interpretability) if they didn’t make this modeling choice—for instance by “looking back” at earlier tokens using causal attention.
Assorted arguments to this effect:
There is probably a lot of compressible redundancy in LLM activations across the token axis, because most ways of combining textual features (given any intuitive notion of “textual feature”) across successive tokens are either very unlikely under the data distribution, or actually impossible.
For example, you’re never going to see a feature that means “we’re in the middle of a lengthy monolingual English passage” at position j and then a feature that means “we’re in the middle of a lengthy monolingual Chinese passage” at position j+1.
In other words, the intrinsic dimensionality of window activations is probably a lot less than n_ctx * n_emb, but SAEs don’t exploit this.
[Another phrasing of the previous point.] Imagine that someone handed you a dataset of (n_ctx, n_emb)-shaped matrices, without telling you where they’re from. (But in fact they’re LLM activations, the same ones we train SAEs on.). And they said “OK, your task is to autoencode these things.”
I think you would quickly notice, just by doing exploratory data analysis, that the data has a ton of structure along the first axis: the rows of each matrix are very much not independent. And you’d design your autoencoder around this fact.
Now someone else comes to you and says “hey look at my cool autoencoder for this problem. It’s actually just an autoencoder for individual rows, and then I autoencode a matrix by applying it separately to the rows one by one.”
This would seem bizarre—you’d want to ask this person what the heck they were thinking.
But this is what today’s SAEs do.
We want features that “make sense,” “are interpretable.” In general, such features will be properties of regions of text (phrases, sentences, passages, or the whole text at once) rather than individual tokens.
Intuitively, such a feature is equally present at every position within the region. An SAE has to pay a high L1 cost to activate the feature over and over at all those positions.
This could lead to an unnatural bias to capture features that are relatively localized, and not capture those that are less localized.
Or, less-localized features might be captured but with “spurious localization”:
Conceptually, the feature is equally “true” of the whole region at once.
At some positions in the region, the balance between L1/reconstruction tips in favor of reconstruction, so the feature is active.
At other positions, the balance tips in favor of L1, and the feature is turned off.
To the interpreter, this looks like a feature that has a clear meaning at the whole-region level, yet flips on and off in a confusing and seemingly arbitrary pattern within the region.
The “spurious localization” story feels like a plausible explanation for the way current SAE features look.
Often if you look at the most-activating cases, there is some obvious property shared by the entire texts you are seeing, but the pattern of feature activation within each text is totally mysterious. Many of the features in the Anthropic Sonnet paper look like this to me.
Descriptions of SAE features typically round this off to a nice-sounding description at the whole-text level, ignoring the uninterpretable pattern over time. You’re being sold an “AI agency feature” (or whatever), but what you actually get is a “feature that activates at seemingly random positions in AI-agency-related texts.”
An SAE that could “look back” at earlier positions might be able to avoid paying “more than one token’s worth of L1″ for a region-level feature, and this might have a very nice interpretation as “diffing” the text.
I’m imagining that a very non-localized (intuitive) feature, such as “this text is in English,” would be active just once at the first position where the property it’s about becomes clearly true.
Ideally, at later positions, the SAE encoder would look back at earlier activations and suppress this feature here because it’s “already been accounted for,” thus saving some L1.
And the decoder would also look back (possibly reusing the same attention output or something) and treat the feature as though it had been active here (in the sense that “active here” would mean in today’s SAEs), thus preserving reconstruction quality.
In this contextual SAE, the features now express only what is new or unpredictable-in-advance at the current position: a “diff” relative to all the features at earlier positions.
For example, if the language switches from English to Cantonese, we’d have one or more feature activations that “turn off” English and “turn on” Cantonese, at the position where the switch first becomes evident.
But within the contiguous, monolingual regions, the language would be implicit in the features at the most recent position where such a “language flag” was set. All the language-flag-setting features would be free to turn off inside these regions, freeing up L0/L1 room for stuff we don’t already know about the text.
This seems like it would allow for vastly higher sparsity at any given level of reconstruction quality—and also better interpretability at any given level of sparsity, because we don’t have the “spurious localization” problem.
(I don’t have any specific architecture for this in mind, though I’ve gestured towards one above. It’s of course possible that this might just not work, or would be very tricky. One danger is that the added expressivity might make your autoencoder “too powerful,” with opaque/polysemantic/etc. calculations chained across positions that recapitulate in miniature the original problem of interpreting models with attention; it may or may not be tough to avoid this in practice.)
At any point before the last attention layer, LLM activations at individual positions are free to be “ambiguous” when taken out of context, in the sense that the same vector might mean different things in two different windows. The LLM can always disambiguate them as needed with attention, later.
This is meant as a counter to the following argument: “activations at individual positions are the right unit of analysis, because they are what the LLM internals can directly access (read off linearly, etc.)”
If we’re using a notion of “what the LLM can directly access” which (taken literally) implies that it can’t “directly access” other positions, that seems way too limiting—we’re effectively pretending the LLM is a bigram model.
Yeah, makes me think about trying to have ‘rotation and translation invariant’ representations of objects in ML vision research.
Seems like if you can subtract out general, longer span terms (but note their presence for the decoder to add them back in), that would be much more intuitive. Language, as you mentioned, is an obvious one. Some others which occur to me are: Whether the text is being spoken by a character in a dialogue (vs the ‘AI assistant’ character).
Whether the text is near the beginning/middle/end of a passage.
Patterns of speech being used in this particular passage of text (e.g. weird punctuation / capitalization patterns).
Implicitly, SAEs are trying to model activations across a context window, shaped like
(n_ctx, n_emb)
. But today’s SAEs ignore the token axis, modeling such activations as lists ofn_ctx
IID samples from a distribution overn_emb
-dim vectors.I suspect SAEs could be much better (at reconstruction, sparsity and feature interpretability) if they didn’t make this modeling choice—for instance by “looking back” at earlier tokens using causal attention.
Assorted arguments to this effect:
There is probably a lot of compressible redundancy in LLM activations across the token axis, because most ways of combining textual features (given any intuitive notion of “textual feature”) across successive tokens are either very unlikely under the data distribution, or actually impossible.
For example, you’re never going to see a feature that means “we’re in the middle of a lengthy monolingual English passage” at position
j
and then a feature that means “we’re in the middle of a lengthy monolingual Chinese passage” at positionj+1
.In other words, the intrinsic dimensionality of window activations is probably a lot less than
n_ctx * n_emb
, but SAEs don’t exploit this.[Another phrasing of the previous point.] Imagine that someone handed you a dataset of
(n_ctx, n_emb)
-shaped matrices, without telling you where they’re from. (But in fact they’re LLM activations, the same ones we train SAEs on.). And they said “OK, your task is to autoencode these things.”I think you would quickly notice, just by doing exploratory data analysis, that the data has a ton of structure along the first axis: the rows of each matrix are very much not independent. And you’d design your autoencoder around this fact.
Now someone else comes to you and says “hey look at my cool autoencoder for this problem. It’s actually just an autoencoder for individual rows, and then I autoencode a matrix by applying it separately to the rows one by one.”
This would seem bizarre—you’d want to ask this person what the heck they were thinking.
But this is what today’s SAEs do.
We want features that “make sense,” “are interpretable.” In general, such features will be properties of regions of text (phrases, sentences, passages, or the whole text at once) rather than individual tokens.
Intuitively, such a feature is equally present at every position within the region. An SAE has to pay a high L1 cost to activate the feature over and over at all those positions.
This could lead to an unnatural bias to capture features that are relatively localized, and not capture those that are less localized.
Or, less-localized features might be captured but with “spurious localization”:
Conceptually, the feature is equally “true” of the whole region at once.
At some positions in the region, the balance between L1/reconstruction tips in favor of reconstruction, so the feature is active.
At other positions, the balance tips in favor of L1, and the feature is turned off.
To the interpreter, this looks like a feature that has a clear meaning at the whole-region level, yet flips on and off in a confusing and seemingly arbitrary pattern within the region.
The “spurious localization” story feels like a plausible explanation for the way current SAE features look.
Often if you look at the most-activating cases, there is some obvious property shared by the entire texts you are seeing, but the pattern of feature activation within each text is totally mysterious. Many of the features in the Anthropic Sonnet paper look like this to me.
Descriptions of SAE features typically round this off to a nice-sounding description at the whole-text level, ignoring the uninterpretable pattern over time. You’re being sold an “AI agency feature” (or whatever), but what you actually get is a “feature that activates at seemingly random positions in AI-agency-related texts.”
An SAE that could “look back” at earlier positions might be able to avoid paying “more than one token’s worth of L1″ for a region-level feature, and this might have a very nice interpretation as “diffing” the text.
I’m imagining that a very non-localized (intuitive) feature, such as “this text is in English,” would be active just once at the first position where the property it’s about becomes clearly true.
Ideally, at later positions, the SAE encoder would look back at earlier activations and suppress this feature here because it’s “already been accounted for,” thus saving some L1.
And the decoder would also look back (possibly reusing the same attention output or something) and treat the feature as though it had been active here (in the sense that “active here” would mean in today’s SAEs), thus preserving reconstruction quality.
In this contextual SAE, the features now express only what is new or unpredictable-in-advance at the current position: a “diff” relative to all the features at earlier positions.
For example, if the language switches from English to Cantonese, we’d have one or more feature activations that “turn off” English and “turn on” Cantonese, at the position where the switch first becomes evident.
But within the contiguous, monolingual regions, the language would be implicit in the features at the most recent position where such a “language flag” was set. All the language-flag-setting features would be free to turn off inside these regions, freeing up L0/L1 room for stuff we don’t already know about the text.
This seems like it would allow for vastly higher sparsity at any given level of reconstruction quality—and also better interpretability at any given level of sparsity, because we don’t have the “spurious localization” problem.
(I don’t have any specific architecture for this in mind, though I’ve gestured towards one above. It’s of course possible that this might just not work, or would be very tricky. One danger is that the added expressivity might make your autoencoder “too powerful,” with opaque/polysemantic/etc. calculations chained across positions that recapitulate in miniature the original problem of interpreting models with attention; it may or may not be tough to avoid this in practice.)
At any point before the last attention layer, LLM activations at individual positions are free to be “ambiguous” when taken out of context, in the sense that the same vector might mean different things in two different windows. The LLM can always disambiguate them as needed with attention, later.
This is meant as a counter to the following argument: “activations at individual positions are the right unit of analysis, because they are what the LLM internals can directly access (read off linearly, etc.)”
If we’re using a notion of “what the LLM can directly access” which (taken literally) implies that it can’t “directly access” other positions, that seems way too limiting—we’re effectively pretending the LLM is a bigram model.
Yeah, makes me think about trying to have ‘rotation and translation invariant’ representations of objects in ML vision research.
Seems like if you can subtract out general, longer span terms (but note their presence for the decoder to add them back in), that would be much more intuitive. Language, as you mentioned, is an obvious one. Some others which occur to me are:
Whether the text is being spoken by a character in a dialogue (vs the ‘AI assistant’ character).
Whether the text is near the beginning/middle/end of a passage.
Patterns of speech being used in this particular passage of text (e.g. weird punctuation / capitalization patterns).