Johnny Lin comments on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Johnny Lin 4 Feb 2024 19:48 UTC
LW: 18 AF: 9
2
AF
Hey Joseph (and coauthors),
Your directions are really fantastic. I hope you don’t mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:
https://www.neuronpedia.org/gpt2-small/res-jb
Your directions are also linked on the home page and the model page.
They’re also accessible by layer (sorted by top activation), eg layer 6: https://neuronpedia.org/gpt2-small/6-res-jb
I added the “Anthropic dashboard” to Neuronpedia for your dataset.
Explanations, comments, and autointerp scoring are also working—anyone can do this:
- Click a direction and submit explanation on the top-left. Here’s another Star Wars direction (5-RES-JB:1681) where GPT4 gave me a score of 96:
  - Click the score for the scoring details:
I plan to do some autointerp explaining on a batch of these directions too.
Btw—your directions are so good that it’s easy to find super interesting stuff. 5-RES-JB:5 is about astronomy:
I’m aware that you’re going to do some library updates to get even better directions, and I’m excited for that—will re-generate/upload all layers after the new changes come in.
Things that I’m still working on and hope to get working in the next few days:
- Making activation testing work for each neuron
- “Search / test” the same way that we have search/test for OpenAI’s directions
Again, your directions look fantastic—congrats. I hope this is useful/interesting for you and anyone trying to browse/explain them. Also, I didn’t know how to provide a citation/reference to you (and your team?) so I just used RES-JB = Residuals by Joseph Bloom and included links to all relevant sources on your directions page.
If there’s anything you’d like me to modify about this, or any feature you’d like me to add to make it better, please do not hesitate to let me know.
- Neel Nanda 6 Feb 2024 6:13 UTC
  10 points
  2
  Parent
  Thanks for doing this, I’m excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability
- Joseph Bloom 7 Feb 2024 16:26 UTC
  1 point
  0
  Parent
  Agreed, thanks so much! Super excited about what can be done here!
  - eggsyntax 28 Feb 2024 21:53 UTC
    5 points
    0
    Parent
    Definitely really exciting! I’d suggest adding a mention of (& link to) the Neuronpedia early on in this article for future readers.
    - Joseph Bloom 28 Feb 2024 23:40 UTC
      2 points
      0
      Parent
      added a link at the top.