Neel Nanda comments on Neuronpedia

Neel Nanda 27 Jul 2023 11:49 UTC
9 points
Cool concept! Thanks for making it. And that’s a lovely looking website, especially for just three weeks!
The core problem with this kind of thing is that often neurons are not actually monosemantic, because models use significant superposition, so the neuron means many different things. This is a pretty insurmountable problem—I don’t think it sinks the concept of the website, but it seems valuable to eg have a “this seems like a polysemantic mess” button.
Bug report—in OWT often apostrophes or quote marks are tokenized as two separate tokens, because of a dumb bug in the tokenizer (they’re a weird unicode character that it doesn’t recognise, so it gets tokenized as two separate bytes). This looks confusing, eg here: (the gap between the name and s is an apostrophe). It’s unclear how best to deal with this, my recommendation is to have an empty string and then an apostrophe/quotation mark, and a footnote on hover explaining it.
- Johnny Lin 27 Jul 2023 16:59 UTC
  2 points
  Parent
  Hi Neel, thanks for playing and thanks for all your incredible work. Neuronpedia uses a ton of your stuff.
  Re: polysemantic neurons—yes, I should address this before wider distribution. Some current ideas—if you have a preference please let me know.
  1. Your proposed “this is a mess” button
  2. Allow voting on more than one option at a time (users can do multiple votes for explanations per neuron on the neuron’s page, but the game automatically moves on to a new neuron after one vote to keep it more “game-like”)
  3. Encourage “or” explanations: “cat or tomato or purple”
  4. Add a warning for users
  Re: Open Web Text tokenizer bug—thank you! this will make the text more legible. i’ll make the display change and footnote.
  EDIT: the double unknown chars should now show up as one apostrophe. AFAIK there is no way to tell the difference between a single quote unknown char and a double quote unknown char, but it’s probably ok to just show it as single quote for now