If the SAEs are not full-distribution competitive, I don’t really trust that the features they’re seeing are actually the variables being computed on in the sense of reflecting the true mechanistic structure of the learned network algorithm and that the explanations they offer are correct[1]. If I pick a small enough sub-distribution, I can pretty much always get perfect reconstruction no matter what kind of probe I use, because e.g. measured over a single token the network layers will have representation rank 1, and the entire network can be written as a rank-1 linear transform. So I can declare the activation vector at layer l to be the active “feature”, use the single entry linear maps between SAEs to “explain” how features between layers map to each other, and be done. Those explanations will of course be nonsense and not at all extrapolate out of distributon. I can’t use them to make a causal model that accurately reproduces the network’s behavior or some aspect of it when dealing with a new prompt.
We don’t train SAEs on literally single tokens, but I would be worried about the qualitative problem persisting. The network itself doesn’t have a million different algorithms to perform a million different narrow subtasks. It has a finite description length. It’s got to be using a smaller set of general algorithms that handle all of these different subtasks, at least to some extent. Likely more so for more powerful and general networks. If our “explanations” of the network then model it in terms of different sets of features and circuits for different narrow subtasks that don’t fit together coherently to give a single good reconstruction loss over the whole distribution, that seems like a sign that our SAE layer activations didn’t actually capture the general algorithms in the network. Thus, predictions about network behaviour made on the basis of inspecting causal relationships between these SAE activations might not be at all reliable, especially predictions about behaviours like instrumental deception which might be very mechanistically related to how the network does well on cross-domain generalisation.
As in, that seems like a minimum requirement for the SAEs to fulfil. Not that this would be enough to make me trust predictions about generalisation based on stories about SAE activations.
> The network itself doesn’t have a million different algorithms to perform a million different narrow subtasks
For what it’s worth, this sort of thinking is really not obvious to me at all. It seems very plausible that frontier models only have their amazing capabilities through the aggregation of a huge number of dumb heuristics (as an aside, I think if true this is net positive for alignment). This is consistent with findings that e.g. grokking and phase changes are much less common in LLMs than toy models.
(Two objections to these claims are that plausibly current frontier models are importantly limited, and also that it’s really hard to prove either me or you correct on this point since it’s all hand-wavy)
Thanks for the first sentence—I appreciate clearly stating a position.
measured over a single token the network layers will have representation rank 1
I don’t follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree—but I don’t see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?
E.g. in this post here we show in detail how an “inside a question beginning with which” SAE feature is computed from which and predicts question marks (I helped with this project but didn’t personally find this feature)
More generally, in narrow distribution mech interp work such as the IOI paper, I don’t think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the “Adversarial examples” in Section 4.4 Alexandre found, for example)
If the SAEs are not full-distribution competitive, I don’t really trust that the features they’re seeing are actually the variables being computed on in the sense of reflecting the true mechanistic structure of the learned network algorithm and that the explanations they offer are correct[1]. If I pick a small enough sub-distribution, I can pretty much always get perfect reconstruction no matter what kind of probe I use, because e.g. measured over a single token the network layers will have representation rank 1, and the entire network can be written as a rank-1 linear transform. So I can declare the activation vector at layer l to be the active “feature”, use the single entry linear maps between SAEs to “explain” how features between layers map to each other, and be done. Those explanations will of course be nonsense and not at all extrapolate out of distributon. I can’t use them to make a causal model that accurately reproduces the network’s behavior or some aspect of it when dealing with a new prompt.
We don’t train SAEs on literally single tokens, but I would be worried about the qualitative problem persisting. The network itself doesn’t have a million different algorithms to perform a million different narrow subtasks. It has a finite description length. It’s got to be using a smaller set of general algorithms that handle all of these different subtasks, at least to some extent. Likely more so for more powerful and general networks. If our “explanations” of the network then model it in terms of different sets of features and circuits for different narrow subtasks that don’t fit together coherently to give a single good reconstruction loss over the whole distribution, that seems like a sign that our SAE layer activations didn’t actually capture the general algorithms in the network. Thus, predictions about network behaviour made on the basis of inspecting causal relationships between these SAE activations might not be at all reliable, especially predictions about behaviours like instrumental deception which might be very mechanistically related to how the network does well on cross-domain generalisation.
As in, that seems like a minimum requirement for the SAEs to fulfil. Not that this would be enough to make me trust predictions about generalisation based on stories about SAE activations.
(This reply is less important than my other)
> The network itself doesn’t have a million different algorithms to perform a million different narrow subtasks
For what it’s worth, this sort of thinking is really not obvious to me at all. It seems very plausible that frontier models only have their amazing capabilities through the aggregation of a huge number of dumb heuristics (as an aside, I think if true this is net positive for alignment). This is consistent with findings that e.g. grokking and phase changes are much less common in LLMs than toy models.
(Two objections to these claims are that plausibly current frontier models are importantly limited, and also that it’s really hard to prove either me or you correct on this point since it’s all hand-wavy)
Thanks for the first sentence—I appreciate clearly stating a position.
I don’t follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree—but I don’t see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?
E.g. in this post here we show in detail how an “inside a question beginning with which” SAE feature is computed from which and predicts question marks (I helped with this project but didn’t personally find this feature)
More generally, in narrow distribution mech interp work such as the IOI paper, I don’t think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the “Adversarial examples” in Section 4.4 Alexandre found, for example)