Arthur Conmy comments on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Arthur Conmy 2 Feb 2024 14:52 UTC
4 points
0
Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.