Nathaniel Monson comments on Neuronpedia

Nathaniel Monson 27 Jul 2023 2:21 UTC
2 points
Very cool idea!

It looked like several of the text samples were from erotica or something, which...seems like something I don’t want to see without actively opting in—is there an easy way for you to filter those out?
- Johnny Lin 27 Jul 2023 2:27 UTC
  4 points
  Parent
  Hi Nathan, thanks for playing and pointing out the issue. My apologies for the inappropriate text.
  Half the text samples are from Open Web Text, which is scraped web data that GPT2 was trained on. I don’t know the exact details, but I believe some of it was reddit and other places.
  If you DM me the neurons address next time you see them, I can start compiling a filter. I will also try to look for an open source library to categorize into safe and not safe.
  My apologies again. This is a beta experiment, thanks for putting up with this while I fix the issues.
  - Caridorc Tergilti 27 Jul 2023 17:59 UTC
    2 points
    Parent
    You can use the “mp-net2” model from sentence transformers for zero-shot classification (scalar product between the text and the embeddings of “sex” and “violence”) decide a cut-off and you are done.
    - Johnny Lin 27 Jul 2023 20:28 UTC
      2 points
      Parent
      Thank you! i will put this on the TODO.
- Johnny Lin 9 Aug 2023 5:07 UTC
  2 points
  Parent
  Hey Nathan, so sorry this took so long. Finally shipped this—you can now toggle “Profanity/Explicit” OFF in “Edit Profile”. Some notes about the implementation:
  - If enabled, hides activation texts that have a bad word in it (still displays the neuron)
  - Works by checking a list of bad words
  - Default is disabled (profanity shown) in order to get more accurate explanations
  - Asks user during onboarding for their preference
  It turns out that nearly all neurons have some sort of explicit (or “looks like explicit”) text, so it’s not feasible to automatically skip all these neurons—we end up skipping everything. So we only hide the individual activation texts that are explicit.
  Sorry again and thanks for the feedback!