SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger.
My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that’s just an unavoidable property of models and superposition.
I’m confused by why you don’t consider “only a few neurons being non-zero” to be a “low dimensional summary of the relevant information in the layer”
The causal graph is large in general, but IMO that’s just an unavoidable property of models and superposition.
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.
I’m confused by why you don’t consider “only a few neurons being non-zero” to be a “low dimensional summary of the relevant information in the layer”
A low-dimensional summary of a variable vector f of size n is a fixed set of random variables d<n that suffice to summarise the state of f. To summarise the state of f using the activations in an SAE dictionary, I have to describe the state of more than n variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than d<n variables, but that just corresponds to undoing the autoencoding and then performing some other compression.
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
I’d be curious to hear more about this—IMO we’re talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
I don’t have the time and energy to do this properly right now, but here’s a few thought experiments to maybe help communicate part of what I mean:
Say you have a transformer model that draws animals. As in, you type “draw me a giraffe”, and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.
The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw.
If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. .
Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up.
And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.
Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense.
The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.
If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn’t matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set ofpretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.
If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.
Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encodea sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn’t thrown away the information needed to compute that there was a bush in the image, but we don’t know it is thinking in bush. It probably isn’t, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn’t have found it.
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we’re trying to explain an LLM (or any generative model of messy real-world data) than they do if we’re trying to explain the animal-drawing NN.
In the animal-drawing example:
There’s only one thing the NN does.
It’s always doing that thing, for every input.
The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size of any reasonable SAE.
With something like an LLM, we expect the situation to be more like:
The NN can do a huge number of “things” or “tasks.” (Equivalently, it can model many different parts of the data manifold with different structures.)
For any given input, it’s only doing roughly one of these “tasks.”
If you try to write out a fully compositional code for each task—akin to the size / furriness / etc. code, but we have a separate one for every task—and then take the Cartesian product of them all to get a giant compositional code for everything at once, this code would have a vast number of coordinates. Much larger than the activation vectors we’d be explaining with an SAE, and also much larger than the dictionary of that SAE.
The aforementioned code would also be super wasteful, because it uses most of its capacity expressing states where multiple tasks compose in an impossible or nonsensical fashion. (Like “The height of the animal currently being drawn is X, AND the current Latin sentence is in the subjunctive mood, AND we are partway through a Rust match expression, AND this author of this op-ed is very right-wing.”)
The NN doesn’t have enough coordinates to express this Cartesian product code, but it also doesn’t need to do so, because the code is wasteful. Instead, it expresses things in a way that’s less-than-fully-compositional (“superposed”) across tasks, no matter how compositional it is within tasks.
Even if every task is represented in a maximally compositional way, the per-task coordinates are still sparse, because we’re only doing ~1 task at once and there are many tasks. The compositional nature of the per-task features doesn’t prohibit them from being sparse, because tasks are sparse.
The reason we’re turning to SAEs is that the NN doesn’t have enough capacity to write out the giant Cartesian product code, so instead it leverages the fact that tasks are sparse, and “re-uses” the same activation coordinates to express different things in different task-contexts.
If this weren’t the case, interpretability would be much simpler: we’d just hunt for a transformation that extracts the Cartesian product code from the NN activations, and then we’re done.
If it existed, this transformation would probably (?) be linear, b/c the information needs to be linearly retrievable within the NN; something in the animal-painter that cares about height needs to be able to look at the height variable, and ideally to do so without wasting a nonlinearity on reconstructing it.
Our goal in using the SAE is not to explain everything in a maximally sparse way; it’s to factor the problem into (sparse tasks) x (possibly dense within-task codes).
Why might that happen in practice? If we fit an SAE to the NN activations on the full data distribution, covering all the tasks, then there are two competing pressures:
On the one hand, the sparsity loss term discourages the SAE from representing any given task in a compositional way, even if the NN does so. All else being equal, this is indeed bad.
On the other hand, the finite dictionary size discourages the SAE from expanding the number of coordinates per task indefinitely, since all the other tasks have to fit somewhere too.
In other words, if your animal-drawing case is one the many tasks, and the SAE is choosing whether to represent it as 50 features that all fire together or 1000 one-hot highly-specific-animal features, it may prefer the former because it doesn’t have room in its dictionary to give every task 1000 features.
This tension only appears when there are multiple tasks. If you just have one compositionally-represented task and a big dictionary, the SAE does behave pathologically as you describe.
But this case is different from the ones that motivate SAEs: there isn’t actually any sparsity in the underlying problem at all!
Whereas with LLMs, we can be pretty sure (I would think?) that there’s extreme sparsity in the underlying problem, due to dimension-counting arguments, intuitions about the number of “tasks” in natural data and their level of overlap, observed behaviors where LLMs represent things that are irrelevant to the vast majority of inputs (like retrieving very obscure facts), etc.
The way I would phrase this concern is “SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations.” E.g. since “tree” is a class of things defined by a bunch of correlations present in the underlying image data, it’s possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here’s one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m’s MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the “man” feature and no others). Most of the features didn’t look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m’s parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens “make,” “makes,” and “making”).
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify “basic clusters” in NN activations (basic in the sense that you allow compositionality, i.e. you don’t try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
because of correlations present in your underlying data (the thing that you seem to be worried about).
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
Most clusters in NN activations on real data might be of the first type
This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I’m forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).
SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger.
My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that’s just an unavoidable property of models and superposition.
I’m confused by why you don’t consider “only a few neurons being non-zero” to be a “low dimensional summary of the relevant information in the layer”
This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.
Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.
A low-dimensional summary of a variable vector f of size n is a fixed set of random variables d<n that suffice to summarise the state of f. To summarise the state of f using the activations in an SAE dictionary, I have to describe the state of more than n variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than d<n variables, but that just corresponds to undoing the autoencoding and then performing some other compression.
I’d be curious to hear more about this—IMO we’re talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.
For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
I don’t have the time and energy to do this properly right now, but here’s a few thought experiments to maybe help communicate part of what I mean:
Say you have a transformer model that draws animals. As in, you type “draw me a giraffe”, and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.
The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw.
If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. .
Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up.
And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.
Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense.
The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.
If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn’t matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.
If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.
Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn’t thrown away the information needed to compute that there was a bush in the image, but we don’t know it is thinking in bush. It probably isn’t, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn’t have found it.
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we’re trying to explain an LLM (or any generative model of messy real-world data) than they do if we’re trying to explain the animal-drawing NN.
In the animal-drawing example:
There’s only one thing the NN does.
It’s always doing that thing, for every input.
The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size of any reasonable SAE.
With something like an LLM, we expect the situation to be more like:
The NN can do a huge number of “things” or “tasks.” (Equivalently, it can model many different parts of the data manifold with different structures.)
For any given input, it’s only doing roughly one of these “tasks.”
If you try to write out a fully compositional code for each task—akin to the size / furriness / etc. code, but we have a separate one for every task—and then take the Cartesian product of them all to get a giant compositional code for everything at once, this code would have a vast number of coordinates. Much larger than the activation vectors we’d be explaining with an SAE, and also much larger than the dictionary of that SAE.
The aforementioned code would also be super wasteful, because it uses most of its capacity expressing states where multiple tasks compose in an impossible or nonsensical fashion. (Like “The height of the animal currently being drawn is X, AND the current Latin sentence is in the subjunctive mood, AND we are partway through a Rust match expression, AND this author of this op-ed is very right-wing.”)
The NN doesn’t have enough coordinates to express this Cartesian product code, but it also doesn’t need to do so, because the code is wasteful. Instead, it expresses things in a way that’s less-than-fully-compositional (“superposed”) across tasks, no matter how compositional it is within tasks.
Even if every task is represented in a maximally compositional way, the per-task coordinates are still sparse, because we’re only doing ~1 task at once and there are many tasks. The compositional nature of the per-task features doesn’t prohibit them from being sparse, because tasks are sparse.
The reason we’re turning to SAEs is that the NN doesn’t have enough capacity to write out the giant Cartesian product code, so instead it leverages the fact that tasks are sparse, and “re-uses” the same activation coordinates to express different things in different task-contexts.
If this weren’t the case, interpretability would be much simpler: we’d just hunt for a transformation that extracts the Cartesian product code from the NN activations, and then we’re done.
If it existed, this transformation would probably (?) be linear, b/c the information needs to be linearly retrievable within the NN; something in the animal-painter that cares about height needs to be able to look at the height variable, and ideally to do so without wasting a nonlinearity on reconstructing it.
Our goal in using the SAE is not to explain everything in a maximally sparse way; it’s to factor the problem into (sparse tasks) x (possibly dense within-task codes).
Why might that happen in practice? If we fit an SAE to the NN activations on the full data distribution, covering all the tasks, then there are two competing pressures:
On the one hand, the sparsity loss term discourages the SAE from representing any given task in a compositional way, even if the NN does so. All else being equal, this is indeed bad.
On the other hand, the finite dictionary size discourages the SAE from expanding the number of coordinates per task indefinitely, since all the other tasks have to fit somewhere too.
In other words, if your animal-drawing case is one the many tasks, and the SAE is choosing whether to represent it as 50 features that all fire together or 1000 one-hot highly-specific-animal features, it may prefer the former because it doesn’t have room in its dictionary to give every task 1000 features.
This tension only appears when there are multiple tasks. If you just have one compositionally-represented task and a big dictionary, the SAE does behave pathologically as you describe.
But this case is different from the ones that motivate SAEs: there isn’t actually any sparsity in the underlying problem at all!
Whereas with LLMs, we can be pretty sure (I would think?) that there’s extreme sparsity in the underlying problem, due to dimension-counting arguments, intuitions about the number of “tasks” in natural data and their level of overlap, observed behaviors where LLMs represent things that are irrelevant to the vast majority of inputs (like retrieving very obscure facts), etc.
The way I would phrase this concern is “SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations.” E.g. since “tree” is a class of things defined by a bunch of correlations present in the underlying image data, it’s possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here’s one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m’s MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the “man” feature and no others). Most of the features didn’t look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m’s parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens “make,” “makes,” and “making”).
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify “basic clusters” in NN activations (basic in the sense that you allow compositionality, i.e. you don’t try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
because of correlations present in your underlying data (the thing that you seem to be worried about).
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
Most clusters in NN activations on real data might be of the first type
This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I’m forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).