I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, −1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feature basis as done by Anthropic and also by you guys. However, the training dataset for the SAE (that you trained afterward) contains no information about the original (arbitrarily chosen by us) basis. SAEs could learn to decompose vectors from the dataset in terms of *any* of the infinite family of bases.
This is exactly what some of your SAEs seem to be doing. They are still learning four antipodal directions (which are just not the same as the four antipodal directions corresponding to your original chosen basis). This, to me, seems like a success of the SAE.
We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.
It’s actually worse than what you say—the first two datasets studied here have privileged basis 45 degrees off from the standard one, which is why the SAEs seem to continue learning the same 45 degree off features. Unpacking this sentence a bit: it turns out that both datasets have principle components 45 degrees off from the basis the authors present as natural, and so as SAE in a sense are trying to capture the principle directions of variation in the activation space, they will also naturally use features 45 degrees off from the “natural” basis.
Consider the first example—by construction, since x_1 and x_2 are anticorrelated perfectly, as are y_1 and y_2, the data is 2 dimensional and can be represented as x = x_1 - x_2 and y = y_1 - y_2. Indeed, this this is exactly what their diagram is assuming. But here, x and y have the same absolute magnitude by construction, and so the dataset lies entirely on the diagonals of the unit square, and the principal components are obviously the diagonals.
Now, why does the SAE want to learn the principle components? This is because it allows the SAE to have smaller activations on average for a given weight norm.
Consider the representation that is axis aligned, in that the SAE neurons are x_1, x_2, y_1, y_2 -- since there’s weight decay, the encoding and decoding weights want to be of the same magnitude. Let’s suppose that the encoding and decoding weights are of size s. Now, if the features are axis aligned, the total size of the activations will be 2A/s^2. But if you instead use the neurons aligned with x_1 + y_1, x_1 + y_2, x_2 + y_1, x_2 + y_2, the activations only need to be of size \sqrt 2 A/s^2. This means that a non-axis aligned representation will have lower loss. Indeed, something like this story is why we expect the L1 penalty to recover “true features” in the first place.
The story for the second dataset is pretty similar to the first—when the data is uniformly distributed over a unit square, the principle directions are the diagonals of the square, not the standard basis.
Hi Lawrence! Thanks so much for this comment and for spelling out (with the math) where and how our thinking and dataset construction were poorly setup. I agree with your analysis and critiques of the first dataset. The biggest problem with that dataset in my eyes (as you point out): the true actual features in the data are not the ones that I wanted them to be (and claimed them to be), so the SAE isn’t really learning “composed features.”
In retrospect, I wish I had just skipped onto the second dataset which had a result that was (to me) surprising at the time of the post. But there I hadn’t thought about looking at the PCs in hidden space, and didn’t realize those were the diagonals. This makes a lot of sense, and now I understand much better why the SAE recovers those.
My big takeaway from this whole post is: I need to think on this all a lot more! I’ve struggled a lot to construct a dataset that successfully has some of the interesting characteristics of language model data and also has interesting compositions / correlations. After a month of playing around and reflection, I don’t think the “two sets of one-hot features” thing we did here is the best way to study this kind of phenomenon.
Thanks for the comment! Just to check that I understand what you’re saying here:
We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.
Basically—you’re saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one set of features they’re just scattered within one quadrant of the unit circle, right? And those are the things that are being fed into the SAE as input, so from that perspective perhaps it makes sense that the uncorrelated case learns the 45∘ angle vectors, because that’s the mean of all of the input training data to the SAE. Neat, hadn’t thought about it in those terms.
This, to me, seems like a success of the SAE.
I can understand this lens! I guess I’m considering this a failure mode because I’m assuming that what we want SAEs to do is to reconstruct the known underlying features, since we (the interp community) are trying to use them to find the “true” underlying features in e.g., natural language. I’ll have to think on this a bit more. To your point—maybe they can’t learn about the original basis choice, and I think that would maybe be bad?
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment
Hey guys, great post and great work!
I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, −1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feature basis as done by Anthropic and also by you guys. However, the training dataset for the SAE (that you trained afterward) contains no information about the original (arbitrarily chosen by us) basis. SAEs could learn to decompose vectors from the dataset in terms of *any* of the infinite family of bases.
This is exactly what some of your SAEs seem to be doing. They are still learning four antipodal directions (which are just not the same as the four antipodal directions corresponding to your original chosen basis). This, to me, seems like a success of the SAE.
We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.
It’s actually worse than what you say—the first two datasets studied here have privileged basis 45 degrees off from the standard one, which is why the SAEs seem to continue learning the same 45 degree off features. Unpacking this sentence a bit: it turns out that both datasets have principle components 45 degrees off from the basis the authors present as natural, and so as SAE in a sense are trying to capture the principle directions of variation in the activation space, they will also naturally use features 45 degrees off from the “natural” basis.
Consider the first example—by construction, since x_1 and x_2 are anticorrelated perfectly, as are y_1 and y_2, the data is 2 dimensional and can be represented as x = x_1 - x_2 and y = y_1 - y_2. Indeed, this this is exactly what their diagram is assuming. But here, x and y have the same absolute magnitude by construction, and so the dataset lies entirely on the diagonals of the unit square, and the principal components are obviously the diagonals.
Now, why does the SAE want to learn the principle components? This is because it allows the SAE to have smaller activations on average for a given weight norm.
Consider the representation that is axis aligned, in that the SAE neurons are x_1, x_2, y_1, y_2 -- since there’s weight decay, the encoding and decoding weights want to be of the same magnitude. Let’s suppose that the encoding and decoding weights are of size s. Now, if the features are axis aligned, the total size of the activations will be 2A/s^2. But if you instead use the neurons aligned with x_1 + y_1, x_1 + y_2, x_2 + y_1, x_2 + y_2, the activations only need to be of size \sqrt 2 A/s^2. This means that a non-axis aligned representation will have lower loss. Indeed, something like this story is why we expect the L1 penalty to recover “true features” in the first place.
The story for the second dataset is pretty similar to the first—when the data is uniformly distributed over a unit square, the principle directions are the diagonals of the square, not the standard basis.
Hi Lawrence! Thanks so much for this comment and for spelling out (with the math) where and how our thinking and dataset construction were poorly setup. I agree with your analysis and critiques of the first dataset. The biggest problem with that dataset in my eyes (as you point out): the true actual features in the data are not the ones that I wanted them to be (and claimed them to be), so the SAE isn’t really learning “composed features.”
In retrospect, I wish I had just skipped onto the second dataset which had a result that was (to me) surprising at the time of the post. But there I hadn’t thought about looking at the PCs in hidden space, and didn’t realize those were the diagonals. This makes a lot of sense, and now I understand much better why the SAE recovers those.
My big takeaway from this whole post is: I need to think on this all a lot more! I’ve struggled a lot to construct a dataset that successfully has some of the interesting characteristics of language model data and also has interesting compositions / correlations. After a month of playing around and reflection, I don’t think the “two sets of one-hot features” thing we did here is the best way to study this kind of phenomenon.
Thanks for the comment! Just to check that I understand what you’re saying here:
Basically—you’re saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one set of features they’re just scattered within one quadrant of the unit circle, right? And those are the things that are being fed into the SAE as input, so from that perspective perhaps it makes sense that the uncorrelated case learns the 45∘ angle vectors, because that’s the mean of all of the input training data to the SAE. Neat, hadn’t thought about it in those terms.
I can understand this lens! I guess I’m considering this a failure mode because I’m assuming that what we want SAEs to do is to reconstruct the known underlying features, since we (the interp community) are trying to use them to find the “true” underlying features in e.g., natural language. I’ll have to think on this a bit more. To your point—maybe they can’t learn about the original basis choice, and I think that would maybe be bad?
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment