It might be hard to scale it to large multilayer models too. The toy model was a single layer model, where the sparse autoencoder was quite big. iirc the latent space was 8 times as big as the residual stream. Imagine trying to interpret GPT4 with a autoencoder that big, and you need to do it over most layers, it’s intractable.
Maybe they can introduce more efficient ways to un-superposition the features, but it doesn’t look trivial.
It might be hard to scale it to large multilayer models too. The toy model was a single layer model, where the sparse autoencoder was quite big. iirc the latent space was 8 times as big as the residual stream. Imagine trying to interpret GPT4 with a autoencoder that big, and you need to do it over most layers, it’s intractable.
Maybe they can introduce more efficient ways to un-superposition the features, but it doesn’t look trivial.