This is from OpenWebText, a recreation of GPT2 training data.
“@#&” [token 48193] occured in 25 out of 20610 chunks. 24 of these were profanity censors (“Everyone thinks they’re so f@#&ing cool and serious”) and only contained a single instance, while the other was the above text (occuring 3299 times!), which was probably used to make the tokenizer, but removed from the training data.
I still don’t know what the hell it is. I’ll post the full text if anyone is interested.
Is this an LLM generation or part of the training data?
This is from OpenWebText, a recreation of GPT2 training data.
“@#&” [token 48193] occured in 25 out of 20610 chunks. 24 of these were profanity censors (“Everyone thinks they’re so f@#&ing cool and serious”) and only contained a single instance, while the other was the above text (occuring 3299 times!), which was probably used to make the tokenizer, but removed from the training data.
I still don’t know what the hell it is. I’ll post the full text if anyone is interested.
Does it not have any sort of metadata telling you where it comes from?
My only guess is that some of it is probably metal lyrics.