Erik Søe Sørensen comments on SolidGoldMagikarp III: Glitch token archaeology

Erik Søe Sørensen 15 Mar 2023 21:49 UTC
13 points
0
The thing about “ÃÂ” appears to be that if you take some (or at least certain) innocent character in the Latin-1-but-not-ASCII code range, say, “æ”, and encode it in UTF-8 – and then take the resulting bytes, interpreting them as Latin-1, and convert them to UTF-8 again – and then repeat that process, you get:
$ echo ‘æ’ | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8 | iconv -f latin1 -t UTF-8
ÃÂÃÂÃÂÃÂ¦
Well, between those various “A”s are actually some invisible “NO BREAK HERE” and “BREAK PERMITTED HERE” characters. The real structure is
Ã<NBH>Â<NBH>Ã<BPH>Â<NBH>Ã<NBH>Â<BPH>Ã<BPH>Â¦
Even if you start with non-Latin-1 characters you may end up with these characters.
The replacement character, for instance, eventually becomes “ÃÂÃÂ¯ÃÂÃÂ¿ÃÂÃÂ½”.
Accidentally interpreting UTF-8 as Latin-1 or vice versa is a fairly easy programming mistake to make, so it’s not too surprising that it’s happened often around the web; doing a web search of “ÃÂÃÂ” shows many occurrences within regular text, such as ”...You donÃÂ¢ÃÂÃÂt review this small book; you tell people about it ÃÂ¢ÃÂÃÂ adults as well as kids ÃÂ¢ÃÂÃÂ and say...”

Anyway, I think that demystifies that group of tokens – including why they happen to occur in lengths-of-exponents-of-twos.
(I’m curious about whether the actual tokens contain the invisible <NBH>/<BPH> characters or not, though...)
Regards,
Erik
(Stumbled into this via Computerphile, by the way.)
What links here?
- Glitch Token Catalog - (Almost) a Full Clear by Lao Mein (21 Sep 2024 12:22 UTC; 38 points)
- A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA) by Lao Mein (20 Sep 2024 13:13 UTC; 37 points)
- mwatkins 15 Mar 2023 22:48 UTC
  1 point
  0
  Parent
  Thanks for this, Erik—very informative.