The thing about “Ô appears to be that if you take some (or at least certain) innocent character in the Latin-1-but-not-ASCII code range, say, “æ”, and encode it in UTF-8 – and then take the resulting bytes, interpreting them as Latin-1, and convert them to UTF-8 again – and then repeat that process, you get:
Well, between those various “A”s are actually some invisible “NO BREAK HERE” and “BREAK PERMITTED HERE” characters. The real structure is
Ã<NBH>Â<NBH>Ã<BPH>Â<NBH>Ã<NBH>Â<BPH>Ã<BPH>¦
Even if you start with non-Latin-1 characters you may end up with these characters. The replacement character, for instance, eventually becomes “ÃÂïÃÂÿÃÂý”.
Accidentally interpreting UTF-8 as Latin-1 or vice versa is a fairly easy programming mistake to make, so it’s not too surprising that it’s happened often around the web; doing a web search of “ÃÂÔ shows many occurrences within regular text, such as ”...You donâÃÂÃÂt review this small book; you tell people about it âÃÂàadults as well as kids âÃÂàand say...”
Anyway, I think that demystifies that group of tokens – including why they happen to occur in lengths-of-exponents-of-twos. (I’m curious about whether the actual tokens contain the invisible <NBH>/<BPH> characters or not, though...)
Regards, Erik
(Stumbled into this via Computerphile, by the way.)
The thing about “Ô appears to be that if you take some (or at least certain) innocent character in the Latin-1-but-not-ASCII code range, say, “æ”, and encode it in UTF-8 – and then take the resulting bytes, interpreting them as Latin-1, and convert them to UTF-8 again – and then repeat that process, you get:
Well, between those various “A”s are actually some invisible “NO BREAK HERE” and “BREAK PERMITTED HERE” characters. The real structure is
Even if you start with non-Latin-1 characters you may end up with these characters.
The replacement character, for instance, eventually becomes “ÃÂïÃÂÿÃÂý”.
Accidentally interpreting UTF-8 as Latin-1 or vice versa is a fairly easy programming mistake to make, so it’s not too surprising that it’s happened often around the web; doing a web search of “ÃÂÔ shows many occurrences within regular text, such as ”...You donâÃÂÃÂt review this small book; you tell people about it âÃÂàadults as well as kids âÃÂàand say...”
Anyway, I think that demystifies that group of tokens – including why they happen to occur in lengths-of-exponents-of-twos.
(I’m curious about whether the actual tokens contain the invisible <NBH>/<BPH> characters or not, though...)
Regards,
Erik
(Stumbled into this via Computerphile, by the way.)
Thanks for this, Erik—very informative.