It looked like several of the text samples were from erotica or something, which...seems like something I don’t want to see without actively opting in—is there an easy way for you to filter those out?
Hi Nathan, thanks for playing and pointing out the issue. My apologies for the inappropriate text.
Half the text samples are from Open Web Text, which is scraped web data that GPT2 was trained on. I don’t know the exact details, but I believe some of it was reddit and other places.
If you DM me the neurons address next time you see them, I can start compiling a filter. I will also try to look for an open source library to categorize into safe and not safe.
My apologies again. This is a beta experiment, thanks for putting up with this while I fix the issues.
You can use the “mp-net2” model from sentence transformers for zero-shot classification (scalar product between the text and the embeddings of “sex” and “violence”) decide a cut-off and you are done.
Hey Nathan, so sorry this took so long. Finally shipped this—you can now toggle “Profanity/Explicit” OFF in “Edit Profile”. Some notes about the implementation:
If enabled, hides activation texts that have a bad word in it (still displays the neuron)
Works by checking a list of bad words
Default is disabled (profanity shown) in order to get more accurate explanations
Asks user during onboarding for their preference
It turns out that nearly all neurons have some sort of explicit (or “looks like explicit”) text, so it’s not feasible to automatically skip all these neurons—we end up skipping everything. So we only hide the individual activation texts that are explicit.
Very cool idea!
It looked like several of the text samples were from erotica or something, which...seems like something I don’t want to see without actively opting in—is there an easy way for you to filter those out?
Hi Nathan, thanks for playing and pointing out the issue. My apologies for the inappropriate text.
Half the text samples are from Open Web Text, which is scraped web data that GPT2 was trained on. I don’t know the exact details, but I believe some of it was reddit and other places.
If you DM me the neurons address next time you see them, I can start compiling a filter. I will also try to look for an open source library to categorize into safe and not safe.
My apologies again. This is a beta experiment, thanks for putting up with this while I fix the issues.
You can use the “mp-net2” model from sentence transformers for zero-shot classification (scalar product between the text and the embeddings of “sex” and “violence”) decide a cut-off and you are done.
Thank you! i will put this on the TODO.
Hey Nathan, so sorry this took so long. Finally shipped this—you can now toggle “Profanity/Explicit” OFF in “Edit Profile”. Some notes about the implementation:
If enabled, hides activation texts that have a bad word in it (still displays the neuron)
Works by checking a list of bad words
Default is disabled (profanity shown) in order to get more accurate explanations
Asks user during onboarding for their preference
It turns out that nearly all neurons have some sort of explicit (or “looks like explicit”) text, so it’s not feasible to automatically skip all these neurons—we end up skipping everything. So we only hide the individual activation texts that are explicit.
Sorry again and thanks for the feedback!