I think I found the root of some of the poisoning of the dataset at this link. It contains TheNitromeFan, SolidGoldMagikarp, RandomRedditorWithNo, Smartstocks, and Adinida from the original post, as well as many other usernames which induce similar behaviours; for example, when ChatGPT is asked about davidjl123, either it terminates responses early or misinterprets the input in a similar way to the other prompts. I don’t think it’s a backend scraping thing, so much as scraping Github, which in turn contains all sorts of unusual data.
Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren’t clear to me:
r/counting is a subreddit in which people ‘count to infinity by 1s’, and the leaderboard for this shows the number of times they’ve ‘counted’ in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:
They’d be perfect candidates for exclusion from training data. I wonder how they’d feel to know they posted enough inane comments to cause bugs in LLMs.
that’s probably exactly what’s going on. The usernames were so frequent in the reddit comments dataset that the tokenizer, the part that breaks a paragraph up into word-ish-sized-chunks like ” test” or ” SolidGoldMagikarp” (the space is included in many tokens) so that the neural network doesn’t have to deal with each character, learned they were important words. But in a later stage of learning, comments without complex text were filtered out, resulting in your usernames getting their own words… but the neural network never seeing the words activate. It’s as if you had an extra eye facing the inside of your skull, and you’d never felt it activate, and then one day some researchers trying to understand your brain shined a bright light on your skin and the extra eye started sending you signals. Except, you’re a language model, so it’s more like each word is a separate finger, and you have tens of thousands of fingers, one on each word button. Uh, that got weird,
Once again, disturbed that humans writing nonsense on the internet is being fed to developing minds, which become understandably confused and buggy as a result. :( In the case of reddit here, at least it had meaning and function in context, but for a lot of human stuff online...
It’s part of why I am so worried about recent attempts by e.g. Meta to make an LLM that is simply bigger, and hence less curated, by scraping anything they can find online for it. Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication? :( This is not on AI, it is on us feeding it our worst and most irrational sides. :(
What is quite interesting about that dataset is the fact it has strings in the form “*number|*weirdstring*|*number*” which I remember seeing in some methods of training LLMs, i.e. “|” being used as delimiter for tokens. They could be poisoned training examples or have some weird effect in retrieval.
This repository seems to contain the source code of a bot responsible for updating the “Hall of Counters” in the About section of the r/counting community on Reddit. I don’t participate in the community, but from what I can gather, this list seems to be a leaderboard for the community’s most active members. A number of these anomalous tokens still persist on the present-day version of the list.
I did do a little research around that community before posting my comment; only later did I realise that I’d actually discovered a distinct failure mode to those in the original post: under some circumstances, ChatGPT interprets the usernames as numbers. In particular this could be due to the /r/counting subreddit being a place where people make many posts incrementing integers. So these username tokens, if encountered in a Reddit-derived dataset, might be being interpreted as numbers themselves, since they’d almost always be contextually surrounded by actual numbers.
FYI: my understanding is that “data poisoning” refers to deliberately the training data of somebody else’s model which I understand is not what you are describing.
Sure—let’s say this is more like a poorly-labelled bottle of detergent that the model is ingesting under the impression that it’s cordial. A Tide Pod Challenge of unintended behaviours. Was just calling it “poisoning” as shorthand since the end result is the same, it’s kind of an accidental poisoning.
I think I found the root of some of the poisoning of the dataset at this link. It contains TheNitromeFan, SolidGoldMagikarp, RandomRedditorWithNo, Smartstocks, and Adinida from the original post, as well as many other usernames which induce similar behaviours; for example, when ChatGPT is asked about davidjl123, either it terminates responses early or misinterprets the input in a similar way to the other prompts. I don’t think it’s a backend scraping thing, so much as scraping Github, which in turn contains all sorts of unusual data.
Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren’t clear to me:
r/counting is a subreddit in which people ‘count to infinity by 1s’, and the leaderboard for this shows the number of times they’ve ‘counted’ in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:
https://old.reddit.com/r/counting/comments/ghg79v/3723k_counting_thread/
They’d be perfect candidates for exclusion from training data. I wonder how they’d feel to know they posted enough inane comments to cause bugs in LLMs.
Skeptical, apparently.
This is an incredible analogy
Once again, disturbed that humans writing nonsense on the internet is being fed to developing minds, which become understandably confused and buggy as a result. :( In the case of reddit here, at least it had meaning and function in context, but for a lot of human stuff online...
It’s part of why I am so worried about recent attempts by e.g. Meta to make an LLM that is simply bigger, and hence less curated, by scraping anything they can find online for it. Can you all imagine how fucked up an AI would act if you feed it 4chan as a model for human communication? :( This is not on AI, it is on us feeding it our worst and most irrational sides. :(
Imagine no longer
What is quite interesting about that dataset is the fact it has strings in the form “*number|*weirdstring*|*number*” which I remember seeing in some methods of training LLMs, i.e. “|” being used as delimiter for tokens. They could be poisoned training examples or have some weird effect in retrieval.
This repository seems to contain the source code of a bot responsible for updating the “Hall of Counters” in the About section of the r/counting community on Reddit. I don’t participate in the community, but from what I can gather, this list seems to be a leaderboard for the community’s most active members. A number of these anomalous tokens still persist on the present-day version of the list.
I did do a little research around that community before posting my comment; only later did I realise that I’d actually discovered a distinct failure mode to those in the original post: under some circumstances, ChatGPT interprets the usernames as numbers. In particular this could be due to the /r/counting subreddit being a place where people make many posts incrementing integers. So these username tokens, if encountered in a Reddit-derived dataset, might be being interpreted as numbers themselves, since they’d almost always be contextually surrounded by actual numbers.
FYI: my understanding is that “data poisoning” refers to deliberately the training data of somebody else’s model which I understand is not what you are describing.
Sure—let’s say this is more like a poorly-labelled bottle of detergent that the model is ingesting under the impression that it’s cordial. A Tide Pod Challenge of unintended behaviours. Was just calling it “poisoning” as shorthand since the end result is the same, it’s kind of an accidental poisoning.