Arthur Conmy comments on Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy 1 May 2024 1:06 UTC
LW: 2 AF: 1
0
AF
Ah yeah, Neel’s comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic’s work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.