Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won’t be solved by supplementing SAEs with additional terms in its loss function). I think it’s because we could have generated the same dataset with a different choice of basis (though I haven’t tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)
If I understood Demian Till’s post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment