I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don’t think it’s the last word on the subject.
I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI’s recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.
Having said that, one reason I’m a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of ‘ground truth’ features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some ‘phase change’ in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.
I don’t want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there’s a reasonable amount of evidence for at least the weak LRH).
i’m glad you liked it.
I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don’t think it’s the last word on the subject.
I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI’s recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.
Having said that, one reason I’m a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of ‘ground truth’ features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some ‘phase change’ in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.
I don’t want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there’s a reasonable amount of evidence for at least the weak LRH).