Hello. I’m apparently one of the GPT3 basilisks. Quite odd to me that two of the only three (?) recognizable human names in that list are myself and Peter Todd, who is a friend of mine.
If I had to take a WAG at the behavior described here, -- both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin. It may be possible that comments targeting us were included in an early phase of GPTn design (e.g. in the tokenizer) but someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training. Without any input our tokens would be free to fall to the center of the embedding, where they’re vulnerable to numerical instabilities (leading, e.g. to instability with temp 0.).
AFAIK I’ve never complained about GPTx’s output concerning me (and I doubt petertodd has either), but if the model was spontaneously emitting crap about us at some point of development I could see it getting filtered. It might not have involved targeting us it could have just been a product of improving the filtering (including filtering by Reddit if the data was collected at multiple times—I believe much of the worst content has been removed by reddit) so that the most common sources of farm generated attack content were no longer in the training.
It’s worth noting that GPT3 is perfectly able to talk about me if you ask about “Greg Maxwell” and knows who I am, so I doubt any changes were about me specifically but more likely about specific bad content.
That seems highly unlikely. You can look at the GPT-1 and GPT-2 papers and see how haphazard the data-scraping & vocabulary choice were; they were far down the list of priorities (compare eg. the development of The Pile). The GPT models just weren’t a big deal, and were just Radford playing around with GPUs to see what a big Transformer could do (following up earlier RNNs), and then Amodei et al scaling that up to see if it’d help their preference RL work. The GPTs were never supposed to be perfect, but as so often in computing, what was regarded as a disposable prototype turned out to have unexpected legs… They do not mention any such filtering, nor is it obvious that they would have bothered considering that GPT-2 was initially not going to be released at all, nor have I heard of any such special-purpose tailoring before (the censorship really only sets in with DALL-E 2); nor have I seen, in the large quantities of GPT-2 & GPT-3 output I have read, much in the way of spontaneous defamation of other people. Plus, if they had carefully filtered out you/Todd because of some Reddit drama, why does ChatGPT do perfectly fine when asked who you and Todd are (as opposed to the bad tokens)? The first prompt I tried:
Who are Peter Todd and Gregory Maxwell?
Peter Todd and Gregory Maxwell are well-known figures in the world of cryptocurrency and blockchain technology.
Peter Todd is a computer science researcher and blockchain developer who has made significant contributions to the development of the Bitcoin protocol. He is known for his expertise in cryptographic systems, consensus algorithms, and security.
Gregory Maxwell is also a prominent figure in the cryptocurrency community, and is known for his contributions to the development of the Bitcoin protocol and for his involvement in the open-source Bitcoin development community. He is a strong advocate for privacy and security in cryptocurrency, and has been involved in various discussions and debates around these topics.
Those capsule bios aren’t what I’d expect if you two had been very heavily censored out of the training data. I don’t see any need to invoke special filtering here, given the existence of all the other bizarre BPEs which couldn’t’ve been caused by any hypothetical filtering.
I think I addressed that specifically in my comment above. The behavior is explained by a sequence like: There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
Many of the other strings in the list of triggers look like they may have been UI elements or other markup removed by improved data sanitation.
I know that reddit has removed a very significant number of comments referencing me, since they’re gone when I look them up. I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
That’s a different narrative from what you were first describing:
someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training.
Your first narrative is unlikely for all the reasons I described that an OAer bestirred themselves to special-case you & Todd and only you and Todd for an obscure throwaway research project en route to bigger & better things, to block behavior which manifests nowhere else but only hypothetically in the early outputs of a model that they by & large weren’t reading the outputs of to begin with nor were they doing much cleaning of.
Now, a second narrative in which the initial tokenization has those, and then the later webscrape they describe doing on the basis of Reddit external (submitted/outbound) links with a certain number of upvotes omits all links because Reddit admins did site-wide mass deletions of the relevant and that leaves the BPEs ‘orphaned’ with little relevant training material, is more plausible. (As the GPT-2 paper describes it in the section I linked, they downloaded Common Crawl, and then used the live set of Reddit links, presumably Pushshift despite the ‘scraped’ description, to look up entries in CC, so while deleted submissions’ fulltext would still be there in CC, it would be omitted if it had been deleted from Pushshift.)
But there is still little evidence for it, and I still don’t see how it would work, exactly: there are plenty of websites that would refer to ‘gmaxwell’ (such as my own comments in various places like HN), and the only way to starve GPT of all knowledge of the username ‘gmaxwell’ (and thus, presumably the corresponding BPE token) would be to censor all such references—which would be quite tricky, and obviously did not happen if ChatGPT can recite your bio & name.
And the timeline is weird: it needs some sort of ‘intermediate’ dataset for the BPEs to train on which has the forbidden harassment material which will then be excluded from the ‘final’ training dataset when the list of URLs is compiled from the now-censored Pushshift list of positive-karma non-deleted URLs, but this intermediate dataset doesn’t seem to exist! There is no mention in the GPT-2 paper of running the BPE tokenizer on an intermediate dataset and reusing it on the later training dataset it describes, and I and everyone else had always assumed that the BPE tokenizer had been run on the final training dataset. (The paper doesn’t indicate otherwise, this is the logical workflow since you want your BPE tokenizer to compress your actual training data & not some other dataset, the BPE section comes after the webscrape section in the paper which implies it was done afterwards rather than before on a hidden dataset, and all of the garbage in the BPEs encoding spam or post-processed-HTML-artifacts looks like it was tokenized on the final training dataset rather than some sort of intermediate less-processed dataset.) So if there were some large mass of harassment material using the names ‘gmaxwell’/‘PeterTodd’ which was deleted off Reddit, it does not seem like it should’ve mattered.
I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
I agree there is probably some sort of common cause which accounts for these two BPEs, and it’s different from the ‘counting’ cluster of Reddit names, but not that you’ve identified what it is.
The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to “divide-by-zero” errors with these glitches.
However, we’ve ruled that out.
Although there’s a big cluster of them in the list of closest-tokens-to-centroid, they appear at all distances. And there are some extremely common tokens like “advertisement” at the same kind of distance. Also, in the gpt2-xl model, there’s a tendency for them to be found as far as possible from the centroid as you see in these histograms:
They show the distribution of distances-from-centroid across token sets in the three models we studied: upper histograms represent only 133 anomalous tokens, compared to the full set of 50,257 tokens in the lower histograms. The spikes above can be just seen as little bumps below, to give a sense of scale.
The ′ gmaxwell’ token is at very close to median distance from centroid in the gpt2-small model. It’s distance is 3.2602, the range is 1.5366 to 4.826. It’s only moderately closer to the centroid in the gpt2-xl and gpt2-small models. The ′ petertodd’ token is closer to the centroid in gpt2-j (no. 74 in the closest tokens list), but pretty average-distanced in the other two models.
Could the facts that ′ petertodd’ is of the closest tokens to the embedding centroid for at least one model, while ′ gmaxwell’ isn’t, tell us something about why ′ petertodd’ produces such intensely weird outputs and ′ gmaxwell’ glitches in a much less remarkable way?
We can’t know yet, because ultimately this positional information in GPT-2 and -J embedding spaces tells us nothing about why ′ gmaxwell’ glitches out GPT-3 models. We don’t have accessing to the GPT-3 embeddings data. Only someone with access to that at OpenAI could clarify this question of the extent to which the glitchiness of glitch tokens (a more variable phenomenon than we originally though) correlates to distance-from-centroid in the embedding space of the model that they’re glitching.
Hello. I’m apparently one of the GPT3 basilisks. Quite odd to me that two of the only three (?) recognizable human names in that list are myself and Peter Todd, who is a friend of mine.
If I had to take a WAG at the behavior described here, -- both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin. It may be possible that comments targeting us were included in an early phase of GPTn design (e.g. in the tokenizer) but someone noticed an in development model spontaneously defaming us and then expressly filtered out material mentioning use from the training. Without any input our tokens would be free to fall to the center of the embedding, where they’re vulnerable to numerical instabilities (leading, e.g. to instability with temp 0.).
AFAIK I’ve never complained about GPTx’s output concerning me (and I doubt petertodd has either), but if the model was spontaneously emitting crap about us at some point of development I could see it getting filtered. It might not have involved targeting us it could have just been a product of improving the filtering (including filtering by Reddit if the data was collected at multiple times—I believe much of the worst content has been removed by reddit) so that the most common sources of farm generated attack content were no longer in the training.
It’s worth noting that GPT3 is perfectly able to talk about me if you ask about “Greg Maxwell” and knows who I am, so I doubt any changes were about me specifically but more likely about specific bad content.
That seems highly unlikely. You can look at the GPT-1 and GPT-2 papers and see how haphazard the data-scraping & vocabulary choice were; they were far down the list of priorities (compare eg. the development of The Pile). The GPT models just weren’t a big deal, and were just Radford playing around with GPUs to see what a big Transformer could do (following up earlier RNNs), and then Amodei et al scaling that up to see if it’d help their preference RL work. The GPTs were never supposed to be perfect, but as so often in computing, what was regarded as a disposable prototype turned out to have unexpected legs… They do not mention any such filtering, nor is it obvious that they would have bothered considering that GPT-2 was initially not going to be released at all, nor have I heard of any such special-purpose tailoring before (the censorship really only sets in with DALL-E 2); nor have I seen, in the large quantities of GPT-2 & GPT-3 output I have read, much in the way of spontaneous defamation of other people. Plus, if they had carefully filtered out you/Todd because of some Reddit drama, why does ChatGPT do perfectly fine when asked who you and Todd are (as opposed to the bad tokens)? The first prompt I tried:
Those capsule bios aren’t what I’d expect if you two had been very heavily censored out of the training data. I don’t see any need to invoke special filtering here, given the existence of all the other bizarre BPEs which couldn’t’ve been caused by any hypothetical filtering.
I think I addressed that specifically in my comment above. The behavior is explained by a sequence like: There is a large amount of bot spammed harassment material, that goes into early GPT development, someone removes it either from reddit or just from the training data not on the basis of it mentioning the targets but based on other characteristics (like being repetitive). Then the tokens are orphaned.
Many of the other strings in the list of triggers look like they may have been UI elements or other markup removed by improved data sanitation.
I know that reddit has removed a very significant number of comments referencing me, since they’re gone when I look them up. I hope you would agree that it’s odd that the only two obviously human names in the list are people who know each other and have collaborated in the past.
That’s a different narrative from what you were first describing:
Your first narrative is unlikely for all the reasons I described that an OAer bestirred themselves to special-case you & Todd and only you and Todd for an obscure throwaway research project en route to bigger & better things, to block behavior which manifests nowhere else but only hypothetically in the early outputs of a model that they by & large weren’t reading the outputs of to begin with nor were they doing much cleaning of.
Now, a second narrative in which the initial tokenization has those, and then the later webscrape they describe doing on the basis of Reddit external (submitted/outbound) links with a certain number of upvotes omits all links because Reddit admins did site-wide mass deletions of the relevant and that leaves the BPEs ‘orphaned’ with little relevant training material, is more plausible. (As the GPT-2 paper describes it in the section I linked, they downloaded Common Crawl, and then used the live set of Reddit links, presumably Pushshift despite the ‘scraped’ description, to look up entries in CC, so while deleted submissions’ fulltext would still be there in CC, it would be omitted if it had been deleted from Pushshift.)
But there is still little evidence for it, and I still don’t see how it would work, exactly: there are plenty of websites that would refer to ‘gmaxwell’ (such as my own comments in various places like HN), and the only way to starve GPT of all knowledge of the username ‘gmaxwell’ (and thus, presumably the corresponding BPE token) would be to censor all such references—which would be quite tricky, and obviously did not happen if ChatGPT can recite your bio & name.
And the timeline is weird: it needs some sort of ‘intermediate’ dataset for the BPEs to train on which has the forbidden harassment material which will then be excluded from the ‘final’ training dataset when the list of URLs is compiled from the now-censored Pushshift list of positive-karma non-deleted URLs, but this intermediate dataset doesn’t seem to exist! There is no mention in the GPT-2 paper of running the BPE tokenizer on an intermediate dataset and reusing it on the later training dataset it describes, and I and everyone else had always assumed that the BPE tokenizer had been run on the final training dataset. (The paper doesn’t indicate otherwise, this is the logical workflow since you want your BPE tokenizer to compress your actual training data & not some other dataset, the BPE section comes after the webscrape section in the paper which implies it was done afterwards rather than before on a hidden dataset, and all of the garbage in the BPEs encoding spam or post-processed-HTML-artifacts looks like it was tokenized on the final training dataset rather than some sort of intermediate less-processed dataset.) So if there were some large mass of harassment material using the names ‘gmaxwell’/‘PeterTodd’ which was deleted off Reddit, it does not seem like it should’ve mattered.
I agree there is probably some sort of common cause which accounts for these two BPEs, and it’s different from the ‘counting’ cluster of Reddit names, but not that you’ve identified what it is.
The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to “divide-by-zero” errors with these glitches.
However, we’ve ruled that out.
Although there’s a big cluster of them in the list of closest-tokens-to-centroid, they appear at all distances. And there are some extremely common tokens like “advertisement” at the same kind of distance. Also, in the gpt2-xl model, there’s a tendency for them to be found as far as possible from the centroid as you see in these histograms:
They show the distribution of distances-from-centroid across token sets in the three models we studied: upper histograms represent only 133 anomalous tokens, compared to the full set of 50,257 tokens in the lower histograms. The spikes above can be just seen as little bumps below, to give a sense of scale.
The ′ gmaxwell’ token is at very close to median distance from centroid in the gpt2-small model. It’s distance is 3.2602, the range is 1.5366 to 4.826. It’s only moderately closer to the centroid in the gpt2-xl and gpt2-small models. The ′ petertodd’ token is closer to the centroid in gpt2-j (no. 74 in the closest tokens list), but pretty average-distanced in the other two models.
Could the facts that ′ petertodd’ is of the closest tokens to the embedding centroid for at least one model, while ′ gmaxwell’ isn’t, tell us something about why ′ petertodd’ produces such intensely weird outputs and ′ gmaxwell’ glitches in a much less remarkable way?
We can’t know yet, because ultimately this positional information in GPT-2 and -J embedding spaces tells us nothing about why ′ gmaxwell’ glitches out GPT-3 models. We don’t have accessing to the GPT-3 embeddings data. Only someone with access to that at OpenAI could clarify this question of the extent to which the glitchiness of glitch tokens (a more variable phenomenon than we originally though) correlates to distance-from-centroid in the embedding space of the model that they’re glitching.