Your directions are really fantastic. I hope you don’t mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:
I added the “Anthropic dashboard” to Neuronpedia for your dataset.
Explanations, comments, and autointerp scoring are also working—anyone can do this:
Click a direction and submit explanation on the top-left. Here’s another Star Wars direction (5-RES-JB:1681) where GPT4 gave me a score of 96:
Click the score for the scoring details:
I plan to do some autointerp explaining on a batch of these directions too.
Btw—your directions are so good that it’s easy to find super interesting stuff. 5-RES-JB:5 is about astronomy:
I’m aware that you’re going to do some library updates to get even better directions, and I’m excited for that—will re-generate/upload all layers after the new changes come in.
Things that I’m still working on and hope to get working in the next few days:
Making activation testing work for each neuron
“Search / test” the same way that we have search/test for OpenAI’s directions
Again, your directions look fantastic—congrats. I hope this is useful/interesting for you and anyone trying to browse/explain them. Also, I didn’t know how to provide a citation/reference to you (and your team?) so I just used RES-JB = Residuals by Joseph Bloom and included links to all relevant sources on your directions page.
If there’s anything you’d like me to modify about this, or any feature you’d like me to add to make it better, please do not hesitate to let me know.
Hey Joseph (and coauthors),
Your directions are really fantastic. I hope you don’t mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:
https://www.neuronpedia.org/gpt2-small/res-jb
Your directions are also linked on the home page and the model page.
They’re also accessible by layer (sorted by top activation), eg layer 6: https://neuronpedia.org/gpt2-small/6-res-jb
I added the “Anthropic dashboard” to Neuronpedia for your dataset.
Explanations, comments, and autointerp scoring are also working—anyone can do this:
Click a direction and submit explanation on the top-left. Here’s another Star Wars direction (5-RES-JB:1681) where GPT4 gave me a score of 96:
Click the score for the scoring details:
I plan to do some autointerp explaining on a batch of these directions too.
Btw—your directions are so good that it’s easy to find super interesting stuff. 5-RES-JB:5 is about astronomy:
I’m aware that you’re going to do some library updates to get even better directions, and I’m excited for that—will re-generate/upload all layers after the new changes come in.
Things that I’m still working on and hope to get working in the next few days:
Making activation testing work for each neuron
“Search / test” the same way that we have search/test for OpenAI’s directions
Again, your directions look fantastic—congrats. I hope this is useful/interesting for you and anyone trying to browse/explain them. Also, I didn’t know how to provide a citation/reference to you (and your team?) so I just used RES-JB = Residuals by Joseph Bloom and included links to all relevant sources on your directions page.
If there’s anything you’d like me to modify about this, or any feature you’d like me to add to make it better, please do not hesitate to let me know.
Thanks for doing this, I’m excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability
Agreed, thanks so much! Super excited about what can be done here!
Definitely really exciting! I’d suggest adding a mention of (& link to) the Neuronpedia early on in this article for future readers.
added a link at the top.