Lumifer comments on The Limits of My Rationality

Lumifer 9 Dec 2014 22:00 UTC
0 points
For future reference, ANSI is not Unicode. You can google up the gory details if interested, but basically ASCII is a seven-bit character set with 128 symbols. The so-called ANSI (it’s a misnomer) extends ASCII to 8 bits and so another 128 symbols, but without specifying what these symbols should be. On most Anglophone computers these will correspond to ISO 8859-1 (or a very similar Windows codepage 1252), but in other parts of the world they will correspond to whatever the local codepage is and it can be anything it wants to be.

UTF-8, on the other hand, is proper Unicode. So it seems the closest you can get to plain ASCII is to use ANSI.
- JoshuaMyer 9 Dec 2014 22:18 UTC
  0 points
  Parent
  So, if I understand the implication, anything encoded in ANSI is not universally machine readable (there are several unfamiliar terms for me here “anglophone” “ISO 8859-1″ and “Windows codepage 1252”)? I probably won’t look up all the details, because I rarely need to know how many bits a method of encryption involves (I’m probably betraying my naivety here) irregardless of the character set used, but I appreciate how solid of a handle you seem to have on the subject.
  - gjm 9 Dec 2014 22:54 UTC
    1 point
    Parent
    
    anglophone
    
    English-speaking.
    
    ISO-8859-1
    
    An 8-bit character set (i.e., representing 256 different characters) suitable for many Western European languages.
    
    Windows codepage 1252
    
    Something very much like ISO-8859-1 but slightly different, used on computers running Microsoft Windows. It’s slightly different because for some reason (there are more and less cynical explanations) Microsoft seem unable to use anything standardized without modifying it a little.
    
    ANSI
    
    Microsoft-Windows-ese for “an 8-bit character set whose first half is the same as ASCII”. Specifying the second half is the job of a “code page”, such as the “code page 1252” mentioned above.
    
    not universally machine readable
    
    Not machine-readable without knowledge of which “code page” (see above) it uses. If you know that, or can guess it, you’re OK.
    
    encryption
    
    Not actually encryption, despite the term “encoding”. A character encoding is a way of representing characters as smallish numbers suitable for storing in a computer. Strictly speaking, every time I said “character set” above I should have said “encoding”. Every time you have any text on a computer, it’s represented internally via some encoding. Common encodings include ASCII (7 bits so 128 characters, but actually some of those 128 slots are reserved for things that aren’t really characters), ISO-8859-1 (8 bits, suitable for much Western European text, though actually nowadays the slightly different ISO-8859-15 is preferred because it includes the Euro currency symbol), UTF-8 (variable length, from 8 to 24 bits per character, represents the whole—very large—Unicode character repertoire). For most purposes UTF-8 is a good bet.
    
    irregardless
    
    Regardless. (Sorry.)
    
    [EDITED to answer the question about “not universally machine readable”.]
    - JoshuaMyer 9 Dec 2014 23:15 UTC
      0 points
      Parent
      It has nothing to do with my article but you’ve made me very happy by explaining this to me. I think I understand better what is meant by “encoding”. Also the bit about regardless I found quite witty and even laughed out loud (xkcd.com kept me informed about the OED’s decision on that word).
      
      So the encoding was probably not the problem then because most programs default ANSI and it was not the unanimous first suggestion from everyone to switch to 7 bit encoding … although I do understand why ASCII is more universal now. Open questions in my mind now include: does the GUI read ASCII and ANSI? And what encoding is used for copy and pasting text?
      - gjm 10 Dec 2014 1:12 UTC
        0 points
        Parent
        
        So the encoding was probably not the problem
        
        The main problem was most likely that your text was full of nonbreaking spaces. A conversion to actual ASCII would have got rid of those because the (rather limited) ASCII character repertoire doesn’t include nonbreaking spaces. I doubt that using an “ANSI” character set did that, though, so yes, the encoding was probably a red herring.
        
        does the GUI read ASCII and ANSI?
        
        What GUI?
        
        what encoding is used for copy and pasting text?
        
        That would be an implementation detail of your operating system; if it’s competently implemented (which I think pretty much everything is these days) you should think of what’s copied and pasted as being made up of characters, not of the numbers used to encode them.
        
        However, at least on some systems, if you copy from one application that supports (not just plain text but) formatted text into another, the formatting will be (at least roughly) preserved. This will happen, e.g., if you copy and paste from a web browser into Microsoft Word. I find that this is scarcely ever what I want. There’s usually a way to paste in just the text (sometimes categorized as “Paste Special”, which may offer other less-common options for pasting stuff too).
        JoshuaMyer 10 Dec 2014 14:37 UTC
        0 points
        Parent
        cool :-)
  - JoshuaMyer 9 Dec 2014 22:18 UTC
    0 points
    Parent
    Either way, I owe you.