Lao Mein comments on Lao Mein’s Shortform

Lao Mein 16 Sep 2024 12:39 UTC
28 points
6
Me: Wow, I wonder what could have possibly caused this character to be so common in the training data. Maybe it’s some sort of code, scraper bug or...
Some asshole in 2006:
Ave Maria : Alessandro Moreschi : Free Download, Borrow, and Streaming : Internet Archive
Carnival of Souls : Free Download, Borrow, and Streaming : Internet Archive
There’s 200,000 instances “Â” on 3 pages of archive.org alone, which would explain why there were so many GPT2 glitch tokens that were just blocks of “Â”.
- Rana Dexsin 16 Sep 2024 16:10 UTC
  53 points
  2
  Parent
  “ÃÂ” is the result of a quine under common data handling errors: when each character is followed by a C1-block control character (in the 80–9F range), UTF-8 encoding followed by Latin-1 decoding expands it into a similarly modified “ÃÂÃÂ”. “Ã” by itself is a proquine of that sequence. Many other characters nearby include the same key C3 byte in their UTF-8 encodings and thus fall into this attractor under repeated mismatched encode/decode operations; for instance, “é” becomes “Ã©” after one round of corruption, “ÃÂ©” after two rounds, and “ÃÂÃÂ©” after three.
  
  (Edited for accuracy; I hadn’t properly described the role of the interleaved control characters the first time.)
  What links here?
  - A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA) by Lao Mein (20 Sep 2024 13:13 UTC; 37 points)
  - Lao Mein 16 Sep 2024 16:26 UTC
    2 points
    0
    Parent
    Any idea how it could happen to the point of 10,000s of consecutive characters? Below is a less extreme example from archive.org where it replaced punctuation some with 16 of them.
    Grateful Dead Live at Manor Downs on 1982-07-31 : Free Borrow & Streaming : Internet Archive
    - Rana Dexsin 16 Sep 2024 16:49 UTC
      29 points
      4
      Parent
      Well, the expansion is exponential, so it doesn’t take that many rounds of bad conversions to get very long strings of this. Any kind of editing or transport process that might be applied multiple times and has mismatched input and output encodings could be the cause; I vaguely remember multiple rounds of “edit this thing I just posted” doing something similar in the 1990s when encoding problems were more the norm, but I don’t know what the Internet Archive or users’ browsers might have been doing in these particular cases.
      
      Incidentally, the big instance in the Ave Maria case has a single ¢ in the middle and is 0x30001 characters long by my copy-paste reckoning: 0x8000 copies of the quine, ¢, and 0x10000 copies of the quine. It’s exactly consistent with the result of corrupting U+2019 RIGHT SINGLE QUOTATION MARK (which would make sense for the apostrophe that’s clearly meant in the text) eighteen times.
      - Lao Mein 16 Sep 2024 16:54 UTC
        2 points
        0
        Parent
        Thanks!
        Rana Dexsin 16 Sep 2024 17:05 UTC
        10 points
        2
        Parent
        For cross-reference purposes for you and/or future readers, it looks like Erik Søe Sørensen made a similar comment (which I hadn’t previously seen) on the post “SolidGoldMagikarp III: Glitch token archaeology” a few years ago.