For future reference, ANSI is not Unicode. You can google up the gory details if interested, but basically ASCII is a seven-bit character set with 128 symbols. The so-called ANSI (it’s a misnomer) extends ASCII to 8 bits and so another 128 symbols, but without specifying what these symbols should be. On most Anglophone computers these will correspond to ISO 8859-1 (or a very similar Windows codepage 1252), but in other parts of the world they will correspond to whatever the local codepage is and it can be anything it wants to be.
UTF-8, on the other hand, is proper Unicode. So it seems the closest you can get to plain ASCII is to use ANSI.
So, if I understand the implication, anything encoded in ANSI is not universally machine readable (there are several unfamiliar terms for me here “anglophone” “ISO 8859-1″ and “Windows codepage 1252”)? I probably won’t look up all the details, because I rarely need to know how many bits a method of encryption involves (I’m probably betraying my naivety here) irregardless of the character set used, but I appreciate how solid of a handle you seem to have on the subject.
An 8-bit character set (i.e., representing 256 different characters) suitable for many Western European languages.
Windows codepage 1252
Something very much like ISO-8859-1 but slightly different, used on computers running Microsoft Windows. It’s slightly different because for some reason (there are more and less cynical explanations) Microsoft seem unable to use anything standardized without modifying it a little.
ANSI
Microsoft-Windows-ese for “an 8-bit character set whose first half is the same as ASCII”. Specifying the second half is the job of a “code page”, such as the “code page 1252” mentioned above.
not universally machine readable
Not machine-readable without knowledge of which “code page” (see above) it uses. If you know that, or can guess it, you’re OK.
encryption
Not actually encryption, despite the term “encoding”. A character encoding is a way of representing characters as smallish numbers suitable for storing in a computer. Strictly speaking, every time I said “character set” above I should have said “encoding”. Every time you have any text on a computer, it’s represented internally via some encoding. Common encodings include ASCII (7 bits so 128 characters, but actually some of those 128 slots are reserved for things that aren’t really characters), ISO-8859-1 (8 bits, suitable for much Western European text, though actually nowadays the slightly different ISO-8859-15 is preferred because it includes the Euro currency symbol), UTF-8 (variable length, from 8 to 24 bits per character, represents the whole—very large—Unicode character repertoire). For most purposes UTF-8 is a good bet.
irregardless
Regardless. (Sorry.)
[EDITED to answer the question about “not universally machine readable”.]
It has nothing to do with my article but you’ve made me very happy by explaining this to me. I think I understand better what is meant by “encoding”. Also the bit about regardless I found quite witty and even laughed out loud (xkcd.com kept me informed about the OED’s decision on that word).
So the encoding was probably not the problem then because most programs default ANSI and it was not the unanimous first suggestion from everyone to switch to 7 bit encoding … although I do understand why ASCII is more universal now. Open questions in my mind now include: does the GUI read ASCII and ANSI? And what encoding is used for copy and pasting text?
The main problem was most likely that your text was full of nonbreaking spaces. A conversion to actual ASCII would have got rid of those because the (rather limited) ASCII character repertoire doesn’t include nonbreaking spaces. I doubt that using an “ANSI” character set did that, though, so yes, the encoding was probably a red herring.
does the GUI read ASCII and ANSI?
What GUI?
what encoding is used for copy and pasting text?
That would be an implementation detail of your operating system; if it’s competently implemented (which I think pretty much everything is these days) you should think of what’s copied and pasted as being made up of characters, not of the numbers used to encode them.
However, at least on some systems, if you copy from one application that supports (not just plain text but) formatted text into another, the formatting will be (at least roughly) preserved. This will happen, e.g., if you copy and paste from a web browser into Microsoft Word. I find that this is scarcely ever what I want. There’s usually a way to paste in just the text (sometimes categorized as “Paste Special”, which may offer other less-common options for pasting stuff too).
What you want right now is a plain-ASCII text file. No Unicode, no HTML, no nothing.
Thank you. I will try this and see if it helps with the paragraph double spacing problem.
Wow. My encoding options are limited to two Unicode variants, ANSI and UTF-8. Will any of those work for these purposes?
For future reference, ANSI is not Unicode. You can google up the gory details if interested, but basically ASCII is a seven-bit character set with 128 symbols. The so-called ANSI (it’s a misnomer) extends ASCII to 8 bits and so another 128 symbols, but without specifying what these symbols should be. On most Anglophone computers these will correspond to ISO 8859-1 (or a very similar Windows codepage 1252), but in other parts of the world they will correspond to whatever the local codepage is and it can be anything it wants to be.
UTF-8, on the other hand, is proper Unicode. So it seems the closest you can get to plain ASCII is to use ANSI.
So, if I understand the implication, anything encoded in ANSI is not universally machine readable (there are several unfamiliar terms for me here “anglophone” “ISO 8859-1″ and “Windows codepage 1252”)? I probably won’t look up all the details, because I rarely need to know how many bits a method of encryption involves (I’m probably betraying my naivety here) irregardless of the character set used, but I appreciate how solid of a handle you seem to have on the subject.
English-speaking.
An 8-bit character set (i.e., representing 256 different characters) suitable for many Western European languages.
Something very much like ISO-8859-1 but slightly different, used on computers running Microsoft Windows. It’s slightly different because for some reason (there are more and less cynical explanations) Microsoft seem unable to use anything standardized without modifying it a little.
Microsoft-Windows-ese for “an 8-bit character set whose first half is the same as ASCII”. Specifying the second half is the job of a “code page”, such as the “code page 1252” mentioned above.
Not machine-readable without knowledge of which “code page” (see above) it uses. If you know that, or can guess it, you’re OK.
Not actually encryption, despite the term “encoding”. A character encoding is a way of representing characters as smallish numbers suitable for storing in a computer. Strictly speaking, every time I said “character set” above I should have said “encoding”. Every time you have any text on a computer, it’s represented internally via some encoding. Common encodings include ASCII (7 bits so 128 characters, but actually some of those 128 slots are reserved for things that aren’t really characters), ISO-8859-1 (8 bits, suitable for much Western European text, though actually nowadays the slightly different ISO-8859-15 is preferred because it includes the Euro currency symbol), UTF-8 (variable length, from 8 to 24 bits per character, represents the whole—very large—Unicode character repertoire). For most purposes UTF-8 is a good bet.
Regardless. (Sorry.)
[EDITED to answer the question about “not universally machine readable”.]
It has nothing to do with my article but you’ve made me very happy by explaining this to me. I think I understand better what is meant by “encoding”. Also the bit about regardless I found quite witty and even laughed out loud (xkcd.com kept me informed about the OED’s decision on that word).
So the encoding was probably not the problem then because most programs default ANSI and it was not the unanimous first suggestion from everyone to switch to 7 bit encoding … although I do understand why ASCII is more universal now. Open questions in my mind now include: does the GUI read ASCII and ANSI? And what encoding is used for copy and pasting text?
The main problem was most likely that your text was full of nonbreaking spaces. A conversion to actual ASCII would have got rid of those because the (rather limited) ASCII character repertoire doesn’t include nonbreaking spaces. I doubt that using an “ANSI” character set did that, though, so yes, the encoding was probably a red herring.
What GUI?
That would be an implementation detail of your operating system; if it’s competently implemented (which I think pretty much everything is these days) you should think of what’s copied and pasted as being made up of characters, not of the numbers used to encode them.
However, at least on some systems, if you copy from one application that supports (not just plain text but) formatted text into another, the formatting will be (at least roughly) preserved. This will happen, e.g., if you copy and paste from a web browser into Microsoft Word. I find that this is scarcely ever what I want. There’s usually a way to paste in just the text (sometimes categorized as “Paste Special”, which may offer other less-common options for pasting stuff too).
cool :-)
Either way, I owe you.
ANSI works if I turn off word wrap and put the space between paragraphs, as you suggested. Thanks again Lumifer.