Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models “The resulting features are here, and contain many single-token features (such as “span”, “file”, ”.”, and “nature”) and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code).” I think further experiments like this which identify classes of features which are highly non-trivial, don’t occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help.
I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it’s not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn’t utilise. Working this out seems important.
Furthermore, the embedding space of LLM’s is highly structured already and since we lack good metrics, it’s hard to say how precisely SAE’s capture “marginal” structure over existing methods. So quantifying what we mean by structure seems important too.
The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it’s far from obvious how we should think about this in practice.
I’m pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.
So in summary: I’m a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.
Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models “The resulting features are here, and contain many single-token features (such as “span”, “file”, ”.”, and “nature”) and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code).” I think further experiments like this which identify classes of features which are highly non-trivial, don’t occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help.
I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it’s not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn’t utilise. Working this out seems important.
Furthermore, the embedding space of LLM’s is highly structured already and since we lack good metrics, it’s hard to say how precisely SAE’s capture “marginal” structure over existing methods. So quantifying what we mean by structure seems important too.
The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it’s far from obvious how we should think about this in practice.
I’m pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.
So in summary: I’m a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.