Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment
Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.
Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren’t so clean as “you can get back the original features by adding 1⁄2 of that and 1⁄2 of this”. I’d imagine it’s more like feature1 = “the in context A”, feature 2 = “the in context B”, feature 3 = “the in context C”. And if the is a real feature (I’m not sure it is), then I don’t know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features!
Yes, good point, I think that Demian’s post was worried about some features not being learned at all, while here all features were learned—even if they were rotated—so that is promising!
Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we’d need to experiment with more variations as mentioned in my other comment