Johnny Lin

Karma: 466

working on neuronpedia

Johnny Lin Jan 15, 2025, 8:14 PM
4 points
0
in reply to: Matthew Khoriaty’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
apologies for the issue with the neuronpedia link. it’s now been resolved.

Johnny Lin May 1, 2024, 8:18 PM
18 points
4
on: Transcoders enable fine-grained interpretable circuit analysis for language models
Hey Jacob + Philippe,
Hope you all don’t mind but we put up layer 8 of your transcoders onto Neuronpedia, with ~22k dashboards here:
https://neuronpedia.org/gpt2-small/8-tres-dc
Each dashboard can be accessed at their own url:
https://neuronpedia.org/gpt2-small/8-tres-dc/0 goes to feature index 0.
You can also test each feature with custom text:
Or search all features at: https://www.neuronpedia.org/gpt2-small/tres-dc
An example search: https://www.neuronpedia.org/gpt2-small/?sourceSet=tres-dc&selectedLayers=[]&sortIndexes=[]&q=the%20cat%20sat%20on%20the%20mat%20at%20MATS
Unfortunately I wasn’t able to generate histograms, autointerp, or other layers for this yet. Am working on getting more layers up first.
Verification
I did spot checks of the first few dashboards and they seem to be correct. Please let me know if anything seems wrong or off. I am also happy to delete this comment if you do not find it useful or for any other reason—no worries.
Please let me know if you have any feedback or issues with this. I will be also reaching out directly via Slack.

Johnny Lin Mar 31, 2024, 6:45 PM
4 points
0
on: SAE-VIS: Announcement Post
Thanks Callum and yep we’ve been extensively using SAE-Vis at Neuronpedia—it’s been extremely helpful for generating dashboards and it’s very well maintained. We’ll have a method of directly importing to Neuronpedia using the exports from SAE-Vis coming out soon.

Johnny Lin Feb 4, 2024, 7:48 PM
LW: 18 AF: 9
2
AF
on: Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Hey Joseph (and coauthors),
Your directions are really fantastic. I hope you don’t mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:
https://www.neuronpedia.org/gpt2-small/res-jb
Your directions are also linked on the home page and the model page.
They’re also accessible by layer (sorted by top activation), eg layer 6: https://neuronpedia.org/gpt2-small/6-res-jb
I added the “Anthropic dashboard” to Neuronpedia for your dataset.
Explanations, comments, and autointerp scoring are also working—anyone can do this:
- Click a direction and submit explanation on the top-left. Here’s another Star Wars direction (5-RES-JB:1681) where GPT4 gave me a score of 96:
  - Click the score for the scoring details:
I plan to do some autointerp explaining on a batch of these directions too.
Btw—your directions are so good that it’s easy to find super interesting stuff. 5-RES-JB:5 is about astronomy:
I’m aware that you’re going to do some library updates to get even better directions, and I’m excited for that—will re-generate/upload all layers after the new changes come in.
Things that I’m still working on and hope to get working in the next few days:
- Making activation testing work for each neuron
- “Search / test” the same way that we have search/test for OpenAI’s directions
Again, your directions look fantastic—congrats. I hope this is useful/interesting for you and anyone trying to browse/explain them. Also, I didn’t know how to provide a citation/reference to you (and your team?) so I just used RES-JB = Residuals by Joseph Bloom and included links to all relevant sources on your directions page.
If there’s anything you’d like me to modify about this, or any feature you’d like me to add to make it better, please do not hesitate to let me know.

Johnny Lin Jan 31, 2024, 6:34 PM
4 points
1
on: Exploring OpenAI’s Latent Directions: Tests, Observations, and Poking Around
Apparently an anonymous user(s) got really excited and ran a bunch of simultaneous searches while I was sleeping, triggering this open tokenizer bug/issue and causing our TransformerLens server to hang/crash. This caused some downtime.
A workaround has been implemented and pushed.

Johnny Lin Jan 31, 2024, 6:31 PM
2 points
0
in reply to: mishka’s comment on: Exploring OpenAI’s Latent Directions: Tests, Observations, and Poking Around
Thanks for the tip! I’ve added the link under “Exploration Tools” after the first mention of Neuronpedia. Let me know if that is the proper way to do it—I couldn’t find a feature on LW for a special “context link” if there is such a feature.

Johnny Lin Oct 1, 2023, 5:44 PM
4 points
0
on: New Tool: the Residual Stream Viewer
Great work Adam, especially on the customizability. It’s fascinating clicking through various types and indexes to look for patterns, and I’m looking forward to using this to find interesting directions.

Johnny Lin Aug 9, 2023, 5:07 AM
2 points
in reply to: Nathaniel Monson’s comment on: Neuronpedia—AI Safety Game
Hey Nathan, so sorry this took so long. Finally shipped this—you can now toggle “Profanity/Explicit” OFF in “Edit Profile”. Some notes about the implementation:
- If enabled, hides activation texts that have a bad word in it (still displays the neuron)
- Works by checking a list of bad words
- Default is disabled (profanity shown) in order to get more accurate explanations
- Asks user during onboarding for their preference
It turns out that nearly all neurons have some sort of explicit (or “looks like explicit”) text, so it’s not feasible to automatically skip all these neurons—we end up skipping everything. So we only hide the individual activation texts that are explicit.
Sorry again and thanks for the feedback!

Johnny Lin Aug 7, 2023, 6:30 PM
1 point
in reply to: Zian’s comment on: Neuronpedia—AI Safety Game
Sorry about that! Should have fixed that way earlier. I’ve transferred the app to “Neuronpedia”, so it should appear correctly now. Thank you for flagging this.

Johnny Lin Jul 29, 2023, 6:09 PM
4 points
in reply to: kuira’s comment on: Neuronpedia—AI Safety Game
Yes, this is a great idea. I think something other than “skip” is needed since skip isn’t declaring “this neuron has too many meanings or doesn’t seem to do anything”, which is actually useful information

Johnny Lin Jul 29, 2023, 8:52 AM
3 points
in reply to: Adele Lopez’s comment on: Neuronpedia—AI Safety Game
re: polysemanticity- have a big tweak to the game coming up that may help with this! i hope to get it out by early next week.

Johnny Lin Jul 29, 2023, 8:44 AM
3 points
in reply to: Adele Lopez’s comment on: Neuronpedia—AI Safety Game
lol thanks. i can’t believe the link has been broken for so long on the site. it should be fixed in a few seconds from now. in the meantime if you’re interested: https://discord.gg/kpEJWgvdAx

Johnny Lin Jul 28, 2023, 7:00 PM
3 points
in reply to: Brendan Long’s comment on: Neuronpedia—AI Safety Game
Thanks Brendan—yes—Apple/Google/Email login is coming before public non-beta release. May also add Facebook.

Johnny Lin Jul 28, 2023, 5:57 PM
2 points
in reply to: sudhanshu_kasewa’s comment on: Neuronpedia—AI Safety Game
I love these variations on the game. Yes, the idea is to build a scalable backend and then we can build various different games on top of it. Initially there was no game and it was just browsing random neurons manually! Then the game was built on top of it. Internally I call the current Neuronpedia game “Vote Mode”.
Would love to have you on our Discord even if just to lurk and occasionally pitch an idea. Thanks for playing!

Johnny Lin Jul 27, 2023, 8:28 PM
2 points
in reply to: Caridorc Tergilti’s comment on: Neuronpedia—AI Safety Game
Thank you! i will put this on the TODO.

Johnny Lin Jul 27, 2023, 5:48 PM
3 points
in reply to: Caridorc Tergilti’s comment on: Neuronpedia—AI Safety Game
Thank you—yes, this is on the TODO along with Apple Sign-In. It will likely not be difficult, but it’s in beta experiment phase right now for feedback and fixes—we aren’t yet ready for the scale of the general public.

Johnny Lin Jul 27, 2023, 5:47 PM
2 points
in reply to: Hoagy’s comment on: Neuronpedia—AI Safety Game
Thank you Hoagy. Expanding beyond the neuron unit is a high priority. I’d like to work with you, Logan Riggs, and others to figure out a good way to make this happen in the next major update so that people can easily view, test, and contribute. I’m now creating a new channel on the discord (#directions) to discuss this: https://discord.gg/kpEJWgvdAx, or I’ll DM you my email if you prefer that.

Johnny Lin Jul 27, 2023, 5:33 PM
3 points
in reply to: Logan Riggs’s comment on: Neuronpedia—AI Safety Game
Hi Logan—thanks for your response. Your dictionaries post is on the TODO to investigate and integrate (someone had referred me to it two weeks ago) - I’d love to make it happen.
Thanks for joining the Discord. Let’s discuss when I have a few days to get caught up with your work.

Johnny Lin Jul 27, 2023, 4:59 PM
2 points
in reply to: Neel Nanda’s comment on: Neuronpedia—AI Safety Game
Hi Neel, thanks for playing and thanks for all your incredible work. Neuronpedia uses a ton of your stuff.
Re: polysemantic neurons—yes, I should address this before wider distribution. Some current ideas—if you have a preference please let me know.
1. Your proposed “this is a mess” button
2. Allow voting on more than one option at a time (users can do multiple votes for explanations per neuron on the neuron’s page, but the game automatically moves on to a new neuron after one vote to keep it more “game-like”)
3. Encourage “or” explanations: “cat or tomato or purple”
4. Add a warning for users
Re: Open Web Text tokenizer bug—thank you! this will make the text more legible. i’ll make the display change and footnote.
EDIT: the double unknown chars should now show up as one apostrophe. AFAIK there is no way to tell the difference between a single quote unknown char and a double quote unknown char, but it’s probably ok to just show it as single quote for now

Johnny Lin Jul 27, 2023, 4:52 PM
3 points
in reply to: Pazzaz’s comment on: Neuronpedia—AI Safety Game
Sorry—New Discord link (changed to a “Community Server”) https://discord.gg/kpEJWgvdAx