metachirality comments on Lao Mein’s Shortform

metachirality 14 Sep 2024 4:38 UTC
1 point
0
Is this an LLM generation or part of the training data?
- Lao Mein 14 Sep 2024 5:20 UTC
  2 points
  0
  Parent
  This is from OpenWebText, a recreation of GPT2 training data.
  “@#&” [token 48193] occured in 25 out of 20610 chunks. 24 of these were profanity censors (“Everyone thinks they’re so f@#&ing cool and serious”) and only contained a single instance, while the other was the above text (occuring 3299 times!), which was probably used to make the tokenizer, but removed from the training data.
  I still don’t know what the hell it is. I’ll post the full text if anyone is interested.
  - metachirality 14 Sep 2024 16:06 UTC
    1 point
    0
    Parent
    Does it not have any sort of metadata telling you where it comes from?
    
    My only guess is that some of it is probably metal lyrics.