I think I addressed that specifically in my comment above. The behavior is explained by a sequence like: There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
Many of the other strings in the list of triggers look like they may have been UI elements or other markup removed by improved data sanitation.
I know that reddit has removed a very significant number of comments referencing me, since they’re gone when I look them up. I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
That’s a different narrative from what you were first describing:
someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training.
Your first narrative is unlikely for all the reasons I described that an OAer bestirred themselves to special-case you & Todd and only you and Todd for an obscure throwaway research project en route to bigger & better things, to block behavior which manifests nowhere else but only hypothetically in the early outputs of a model that they by & large weren’t reading the outputs of to begin with nor were they doing much cleaning of.
Now, a second narrative in which the initial tokenization has those, and then the later webscrape they describe doing on the basis of Reddit external (submitted/outbound) links with a certain number of upvotes omits all links because Reddit admins did site-wide mass deletions of the relevant and that leaves the BPEs ‘orphaned’ with little relevant training material, is more plausible. (As the GPT-2 paper describes it in the section I linked, they downloaded Common Crawl, and then used the live set of Reddit links, presumably Pushshift despite the ‘scraped’ description, to look up entries in CC, so while deleted submissions’ fulltext would still be there in CC, it would be omitted if it had been deleted from Pushshift.)
But there is still little evidence for it, and I still don’t see how it would work, exactly: there are plenty of websites that would refer to ‘gmaxwell’ (such as my own comments in various places like HN), and the only way to starve GPT of all knowledge of the username ‘gmaxwell’ (and thus, presumably the corresponding BPE token) would be to censor all such references—which would be quite tricky, and obviously did not happen if ChatGPT can recite your bio & name.
And the timeline is weird: it needs some sort of ‘intermediate’ dataset for the BPEs to train on which has the forbidden harassment material which will then be excluded from the ‘final’ training dataset when the list of URLs is compiled from the now-censored Pushshift list of positive-karma non-deleted URLs, but this intermediate dataset doesn’t seem to exist! There is no mention in the GPT-2 paper of running the BPE tokenizer on an intermediate dataset and reusing it on the later training dataset it describes, and I and everyone else had always assumed that the BPE tokenizer had been run on the final training dataset. (The paper doesn’t indicate otherwise, this is the logical workflow since you want your BPE tokenizer to compress your actual training data & not some other dataset, the BPE section comes after the webscrape section in the paper which implies it was done afterwards rather than before on a hidden dataset, and all of the garbage in the BPEs encoding spam or post-processed-HTML-artifacts looks like it was tokenized on the final training dataset rather than some sort of intermediate less-processed dataset.) So if there were some large mass of harassment material using the names ‘gmaxwell’/‘PeterTodd’ which was deleted off Reddit, it does not seem like it should’ve mattered.
I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
I agree there is probably some sort of common cause which accounts for these two BPEs, and it’s different from the ‘counting’ cluster of Reddit names, but not that you’ve identified what it is.
I think I addressed that specifically in my comment above. The behavior is explained by a sequence like: There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
Many of the other strings in the list of triggers look like they may have been UI elements or other markup removed by improved data sanitation.
I know that reddit has removed a very significant number of comments referencing me, since they’re gone when I look them up. I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
That’s a different narrative from what you were first describing:
Your first narrative is unlikely for all the reasons I described that an OAer bestirred themselves to special-case you & Todd and only you and Todd for an obscure throwaway research project en route to bigger & better things, to block behavior which manifests nowhere else but only hypothetically in the early outputs of a model that they by & large weren’t reading the outputs of to begin with nor were they doing much cleaning of.
Now, a second narrative in which the initial tokenization has those, and then the later webscrape they describe doing on the basis of Reddit external (submitted/outbound) links with a certain number of upvotes omits all links because Reddit admins did site-wide mass deletions of the relevant and that leaves the BPEs ‘orphaned’ with little relevant training material, is more plausible. (As the GPT-2 paper describes it in the section I linked, they downloaded Common Crawl, and then used the live set of Reddit links, presumably Pushshift despite the ‘scraped’ description, to look up entries in CC, so while deleted submissions’ fulltext would still be there in CC, it would be omitted if it had been deleted from Pushshift.)
But there is still little evidence for it, and I still don’t see how it would work, exactly: there are plenty of websites that would refer to ‘gmaxwell’ (such as my own comments in various places like HN), and the only way to starve GPT of all knowledge of the username ‘gmaxwell’ (and thus, presumably the corresponding BPE token) would be to censor all such references—which would be quite tricky, and obviously did not happen if ChatGPT can recite your bio & name.
And the timeline is weird: it needs some sort of ‘intermediate’ dataset for the BPEs to train on which has the forbidden harassment material which will then be excluded from the ‘final’ training dataset when the list of URLs is compiled from the now-censored Pushshift list of positive-karma non-deleted URLs, but this intermediate dataset doesn’t seem to exist! There is no mention in the GPT-2 paper of running the BPE tokenizer on an intermediate dataset and reusing it on the later training dataset it describes, and I and everyone else had always assumed that the BPE tokenizer had been run on the final training dataset. (The paper doesn’t indicate otherwise, this is the logical workflow since you want your BPE tokenizer to compress your actual training data & not some other dataset, the BPE section comes after the webscrape section in the paper which implies it was done afterwards rather than before on a hidden dataset, and all of the garbage in the BPEs encoding spam or post-processed-HTML-artifacts looks like it was tokenized on the final training dataset rather than some sort of intermediate less-processed dataset.) So if there were some large mass of harassment material using the names ‘gmaxwell’/‘PeterTodd’ which was deleted off Reddit, it does not seem like it should’ve mattered.
I agree there is probably some sort of common cause which accounts for these two BPEs, and it’s different from the ‘counting’ cluster of Reddit names, but not that you’ve identified what it is.