We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention.
Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models “The resulting features are here, and contain many single-token features (such as “span”, “file”, ”.”, and “nature”) and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code).” I think further experiments like this which identify classes of features which are highly non-trivial, don’t occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help.
I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it’s not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn’t utilise. Working this out seems important.
Furthermore, the embedding space of LLM’s is highly structured already and since we lack good metrics, it’s hard to say how precisely SAE’s capture “marginal” structure over existing methods. So quantifying what we mean by structure seems important too.
The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it’s far from obvious how we should think about this in practice.
I’m pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.
So in summary: I’m a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.
The most obvious way SAEs could suck is that they might reveal structure of the data, rather than in the networks (finding vectors for combination of features that correlate in practice, rather than logically/mechanistically independent features). Do you have any thoughts on how you would notice if that was the case?
Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models “The resulting features are here, and contain many single-token features (such as “span”, “file”, ”.”, and “nature”) and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code).” I think further experiments like this which identify classes of features which are highly non-trivial, don’t occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help.
I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it’s not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn’t utilise. Working this out seems important.
Furthermore, the embedding space of LLM’s is highly structured already and since we lack good metrics, it’s hard to say how precisely SAE’s capture “marginal” structure over existing methods. So quantifying what we mean by structure seems important too.
The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it’s far from obvious how we should think about this in practice.
I’m pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.
So in summary: I’m a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.