2. Features are universal, meaning two models trained on the same data and achieving equal performance must learn identical features.
I would personally be very surprised if this is true in its strongest form. Empirically, different models can find more than one algorithm that achieves minimal loss [even on incredibly simple tasks like modular addition](https://arxiv.org/pdf/2306.17844.pdf).
As a side note, my understanding is that if you have two independent real features A and B which both occur with non-trivial frequency, and the SAE is sufficiently wide, the SAE may learn features A&B, A&!B and !A&B rather than simply learning features A and B, because that yields better L0 loss (since L0 loss should be p(A) + p(B) for the [A, B] feature set, and P(A&B) + P(A&!B) + P(!A&B) == P(A) + P(B) - P(A&B) for the [A&B, A&!B, !A&B] feature set). The [A,B] representation still has better sparsity, by the Elhage et. al. definition, but I don’t think that necessarily means that the [A,B] representation corresponds to minimal L0 loss. Not sure how much of an issue this is in practice though. The post “Do sparse autoencoders find true features” has a bunch more detail on this sort of thing.
This might all be academic if dff≫−n(1−S)log(1−S) (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).
This might all be academic if dff≫−n(1−S)log(1−S) (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).
Thanks for the feedback, this is a great point! I haven’t come across evidence in real models which points towards this. My default assumption was that they are operating near the upper bounds of superposition capacity possible. It would be great to know if they aren’t, as it affects how we estimate the number of features and subsequently the SAE expansion factor.
It would be great to know if they aren’t, as it affects how we estimate the number of features and subsequently the SAE expansion factor.
My impression from people working on SAEs is that the optimal number of features is very much an open question. In Toward Monosemanticity they observe that different numbers of features work fine; you just get feature splitting / collapse as you go bigger / smaller.
The scaling laws are not mere empirical observations
This seems like a strong claim; are you aware of arguments or evidence for it? My impression (not at all strongly held) was that it’s seen as a useful rule of thumb that may or may not continue to hold.
I would personally be very surprised if this is true in its strongest form. Empirically, different models can find more than one algorithm that achieves minimal loss [even on incredibly simple tasks like modular addition](https://arxiv.org/pdf/2306.17844.pdf).
As a side note, my understanding is that if you have two independent real features A and B which both occur with non-trivial frequency, and the SAE is sufficiently wide, the SAE may learn features
A&B
,A&!B
and!A&B
rather than simply learning featuresA
andB
, because that yields better L0 loss (since L0 loss should bep(A) + p(B)
for the[A, B]
feature set, andP(A&B) + P(A&!B) + P(!A&B) == P(A) + P(B) - P(A&B)
for the[A&B, A&!B, !A&B]
feature set). The[A,B]
representation still has better sparsity, by the Elhage et. al. definition, but I don’t think that necessarily means that the[A,B]
representation corresponds to minimal L0 loss. Not sure how much of an issue this is in practice though. The post “Do sparse autoencoders find true features” has a bunch more detail on this sort of thing.This might all be academic if dff≫−n(1−S)log(1−S) (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).
Thanks for the feedback, this is a great point! I haven’t come across evidence in real models which points towards this. My default assumption was that they are operating near the upper bounds of superposition capacity possible. It would be great to know if they aren’t, as it affects how we estimate the number of features and subsequently the SAE expansion factor.
My impression from people working on SAEs is that the optimal number of features is very much an open question. In Toward Monosemanticity they observe that different numbers of features work fine; you just get feature splitting / collapse as you go bigger / smaller.
This seems like a strong claim; are you aware of arguments or evidence for it? My impression (not at all strongly held) was that it’s seen as a useful rule of thumb that may or may not continue to hold.