I haven’t yet paid enough attention to dictionary learning myself to give a confident answer. But here’s an incomplete answer, and some of the things I’d think about if I were to investigate the topic more:
Insofar as dictionary learning is building on e.g. Olah team’s Toy Models of Superposition work, at least some of the work of “discovering the ontology” has been done. So it’s not completely implausible that this method is basically right! I’d still be skeptical both about how well the Toy Models work generalizes, and how well dictionary learning captures the phenomenon Toy Models found (e.g. “seems kinda intuitively related” doesn’t really cut it).
There’s still the “interpret the features as what?” side of the problem. E.g. insofar as things are based on the Toy Models phenomenon, the “as what” should be sparse features in the data/environment (where the phrase “sparse feature” is interpreted in the same specific way as the Toy Models work, not just some vague intuition about sparsity and features). And then we need to pretty carefully consider which human-intuitive stuff does and does not constitute “sparse features” in the data in that specific sense.
It might be hard to scale it to large multilayer models too. The toy model was a single layer model, where the sparse autoencoder was quite big. iirc the latent space was 8 times as big as the residual stream. Imagine trying to interpret GPT4 with a autoencoder that big, and you need to do it over most layers, it’s intractable.
Maybe they can introduce more efficient ways to un-superposition the features, but it doesn’t look trivial.
I haven’t yet paid enough attention to dictionary learning myself to give a confident answer. But here’s an incomplete answer, and some of the things I’d think about if I were to investigate the topic more:
Insofar as dictionary learning is building on e.g. Olah team’s Toy Models of Superposition work, at least some of the work of “discovering the ontology” has been done. So it’s not completely implausible that this method is basically right! I’d still be skeptical both about how well the Toy Models work generalizes, and how well dictionary learning captures the phenomenon Toy Models found (e.g. “seems kinda intuitively related” doesn’t really cut it).
Implementation details matter. If there’s degrees of freedom where people made arbitrary-looking design choices, that’s a very bad sign. (Note: link is to a post about ad-hoc mathematical definitions, but the same general considerations apply here.)
There’s still the “interpret the features as what?” side of the problem. E.g. insofar as things are based on the Toy Models phenomenon, the “as what” should be sparse features in the data/environment (where the phrase “sparse feature” is interpreted in the same specific way as the Toy Models work, not just some vague intuition about sparsity and features). And then we need to pretty carefully consider which human-intuitive stuff does and does not constitute “sparse features” in the data in that specific sense.
It might be hard to scale it to large multilayer models too. The toy model was a single layer model, where the sparse autoencoder was quite big. iirc the latent space was 8 times as big as the residual stream. Imagine trying to interpret GPT4 with a autoencoder that big, and you need to do it over most layers, it’s intractable.
Maybe they can introduce more efficient ways to un-superposition the features, but it doesn’t look trivial.