Thanks for the comment! Just to check that I understand what you’re saying here:
We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.
Basically—you’re saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one set of features they’re just scattered within one quadrant of the unit circle, right? And those are the things that are being fed into the SAE as input, so from that perspective perhaps it makes sense that the uncorrelated case learns the 45∘ angle vectors, because that’s the mean of all of the input training data to the SAE. Neat, hadn’t thought about it in those terms.
This, to me, seems like a success of the SAE.
I can understand this lens! I guess I’m considering this a failure mode because I’m assuming that what we want SAEs to do is to reconstruct the known underlying features, since we (the interp community) are trying to use them to find the “true” underlying features in e.g., natural language. I’ll have to think on this a bit more. To your point—maybe they can’t learn about the original basis choice, and I think that would maybe be bad?
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment
Thanks for the comment! Just to check that I understand what you’re saying here:
Basically—you’re saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one set of features they’re just scattered within one quadrant of the unit circle, right? And those are the things that are being fed into the SAE as input, so from that perspective perhaps it makes sense that the uncorrelated case learns the 45∘ angle vectors, because that’s the mean of all of the input training data to the SAE. Neat, hadn’t thought about it in those terms.
I can understand this lens! I guess I’m considering this a failure mode because I’m assuming that what we want SAEs to do is to reconstruct the known underlying features, since we (the interp community) are trying to use them to find the “true” underlying features in e.g., natural language. I’ll have to think on this a bit more. To your point—maybe they can’t learn about the original basis choice, and I think that would maybe be bad?
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment