Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren’t clear to me:
r/counting is a subreddit in which people ‘count to infinity by 1s’, and the leaderboard for this shows the number of times they’ve ‘counted’ in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:
They’d be perfect candidates for exclusion from training data. I wonder how they’d feel to know they posted enough inane comments to cause bugs in LLMs.
that’s probably exactly what’s going on. The usernames were so frequent in the reddit comments dataset that the tokenizer, the part that breaks a paragraph up into word-ish-sized-chunks like ” test” or ” SolidGoldMagikarp” (the space is included in many tokens) so that the neural network doesn’t have to deal with each character, learned they were important words. But in a later stage of learning, comments without complex text were filtered out, resulting in your usernames getting their own words… but the neural network never seeing the words activate. It’s as if you had an extra eye facing the inside of your skull, and you’d never felt it activate, and then one day some researchers trying to understand your brain shined a bright light on your skin and the extra eye started sending you signals. Except, you’re a language model, so it’s more like each word is a separate finger, and you have tens of thousands of fingers, one on each word button. Uh, that got weird,
Once again, disturbed that humans writing nonsense on the internet is being fed to developing minds, which become understandably confused and buggy as a result. :( In the case of reddit here, at least it had meaning and function in context, but for a lot of human stuff online...
It’s part of why I am so worried about recent attempts by e.g. Meta to make an LLM that is simply bigger, and hence less curated, by scraping anything they can find online for it. Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication? :( This is not on AI, it is on us feeding it our worst and most irrational sides. :(
Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren’t clear to me:
r/counting is a subreddit in which people ‘count to infinity by 1s’, and the leaderboard for this shows the number of times they’ve ‘counted’ in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:
https://old.reddit.com/r/counting/comments/ghg79v/3723k_counting_thread/
They’d be perfect candidates for exclusion from training data. I wonder how they’d feel to know they posted enough inane comments to cause bugs in LLMs.
Skeptical, apparently.
This is an incredible analogy
Once again, disturbed that humans writing nonsense on the internet is being fed to developing minds, which become understandably confused and buggy as a result. :( In the case of reddit here, at least it had meaning and function in context, but for a lot of human stuff online...
It’s part of why I am so worried about recent attempts by e.g. Meta to make an LLM that is simply bigger, and hence less curated, by scraping anything they can find online for it. Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication? :( This is not on AI, it is on us feeding it our worst and most irrational sides. :(
Imagine no longer