Neel Nanda comments on Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda 26 Apr 2024 2:48 UTC
LW: 4 AF: 3
0
AF
Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This’ll be added to the paper soon, sorry!). I’ll leave the other Qs for my co-authors
- leogao 1 May 2024 0:01 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Got it—do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?
  - Arthur Conmy 1 May 2024 0:49 UTC
    LW: 2 AF: 1
    0
    AF Parent
    I’m not sure what you mean by “the reinitialization approach” but feature death doesn’t seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn’t have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.
    - leogao 1 May 2024 1:01 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Sorry I meant the Anthropiclike neuron resampling procedure.
      
      I think I misread Neel’s comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).
      - Arthur Conmy 1 May 2024 1:06 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Ah yeah, Neel’s comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic’s work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.