′ newcom’, ‘slaught’, ‘senal’ and ‘volunte’
I think these could be a result of a simple stemming algorithm:
newcomer → newcom
volunteer → volunte
senaling → senal
Stemming can be used to preprocess text and to create indexes in information retrieval.
Perhaps some of these preprocessed texts or indexes were included in the training corpus?
It’s not that mysterious that they ended up as tokens. What’s puzzling is why so many completions to prompts asking GPT3 to repeat the “forbidden” token strings include them.
I think these could be a result of a simple stemming algorithm:
newcomer → newcom
volunteer → volunte
senaling → senal
Stemming can be used to preprocess text and to create indexes in information retrieval.
Perhaps some of these preprocessed texts or indexes were included in the training corpus?
It’s not that mysterious that they ended up as tokens. What’s puzzling is why so many completions to prompts asking GPT3 to repeat the “forbidden” token strings include them.