Information theory challenge: A few posters have mentioned here that the average entropy of a character in English is about one bit. This carries an interesting implication: you should be able to create an interface using only two of the keyboards keys, such that composing an English message requires just as many keystrokes, on average, as it takes on a regular keyboard.
To do so, you’d have to exploit all the regularities of English to offer suggestions that save the user from having to specify individual letters. Most of the entropy is in the intial charaters of a word or message, so you would probably spend more strokes on specifying those, but then make it up with some “autocomplete” feature for large portions of the message.
If that’s too hard, it should be a lot easier to do a 3-input method, which only requires your message set to have an entropy of less than ~1.5 bits per character.
Just thought I’d point that out, as it might be something worth thinking about.
It doesn’t reach the 0.7-1 bit per character limit, of course, but then, according to the Hutter challenge no compression program (online or offline) has.
A few posters have mentioned here that the average entropy of a character in English is about one bit. This carries an interesting implication: you should be able to create an interface using only two of the keyboards keys, such that composing an English message requires just as many keystrokes, on average, as it takes on a regular keyboard.
One way to achieve this (though not practical for use in human use interfaces) would be to input the entire message bit by bit in some powerful lossless compression format optimized specifically for English text, and decompress it at the end of input. This way, you’d eliminate as much redundancy in your input as the compression algorithm is capable of removing.
The really interesting question, of course, is what are the limits of such technologies in practical applications. But if anyone has an original idea there, they’d likely cash in on it rather than post it here.
Shannon’s estimate of 0.6 to 1.3 was based on having humans guess the next character out a 27 character alphabet including spaces but no other punctuation.
The impractical leading algorithm achieves 1.3 bits per byte on the first 10^8 bytes of wikipedia. This page says that stripping wikipedia down to a simple alphabet doesn’t affect compression ratios much. I think that means that it hits Shannon’s upper estimate. But it’s not normal text (eg, redirects), so I’m not sure in which way its entropy differs. The practical (for computer, not human) algorithm bzip2 achieves 2.3 bits per byte on wikipedia and I find it achieves 2.1 bits per character on normal text (which suggests that wikipedia has more entropy and thus that the leading algorithm is beating Shannon’s estimate).
Since Sniffnoy asked about arithmetic coding: if I understand correctly, this page claims that arithmetic coding of characters achieves 4 bits per character and 2.8 bits per character if the alphabet is 4-tuples.
bzip2 is known to be both slow and not too great at compression; what does lzma-2 (faster & smaller) get you on Wikipedia?
(Also, I would expect redirects to play in a compression algorithm’s favor compared to natural language. A redirect almost always takes the stereotypical form #REDIRECT[[foo]] or #redirect[[foo]]. It would have difficulty compressing the target, frequently a proper name, but the other 13 characters? Pure gravy.)
Here are the numbers for a pre-LZMA2 version of 7zip. It looks like LZMA is 2.0 bits per byte, while some other option is 1.7 bits per byte.
Yes, I would expect wikipedia to compress more than text, but it doesn’t seem to be so. This is just for the first 100MB. At a gig, all compression programs do dramatically better, even off-the-shelf ones that shouldn’t window that far. Maybe there is a lot of random vandalism early in the alphabet?
Well, early on there are many weirdly titled pages, and I could imagine that the first 100MB includes all the ’1958 in British Tennis’-style year articles. But intuitively that doesn’t feel like enough to cause bad results.
Nor have any of the articles or theses I’ve read on vandalism detection noted any unusual distributions of vandalism; further, obvious vandalism like gibberish/high-entropy-strings are the very least long-lived forms of vandalism—long-lived vandalism looks plausible & correct, and indistinguishable from normal English even to native speakers (much less a compression algorithm).
A window really does sound like the best explanation, until someone tries out 100MB chunks from other areas of Wikipedia and finds they compress comparably to 1GB.
bzip’s window is 900k, yet it compresses 100MB to 29% but 1GB to 25%. Increasing the memory on 7zip’s PPM makes a larger difference on 1GB than 100MB, so maybe it’s the window that’s relevant there, but it doesn’t seem very plausible to me. (18.5% → 17.8% vs 21.3% → 21.1%)
Sporting lists might compress badly, especially if they contain times, but this one seems to compress well.
I don’t think arithmetic coding achieves the 1 bit / character theoretical entropy of common English, as that requires knowledge of very complex boundaries in the probability distribution. If you know a color word is coming next, you can capitalize on it, but not letterwise.
Of course, if you permit a large enough block size, then it could work, but the lookup table would probably be umanageable.
Information theory challenge: A few posters have mentioned here that the average entropy of a character in English is about one bit. This carries an interesting implication: you should be able to create an interface using only two of the keyboards keys, such that composing an English message requires just as many keystrokes, on average, as it takes on a regular keyboard.
To do so, you’d have to exploit all the regularities of English to offer suggestions that save the user from having to specify individual letters. Most of the entropy is in the intial charaters of a word or message, so you would probably spend more strokes on specifying those, but then make it up with some “autocomplete” feature for large portions of the message.
If that’s too hard, it should be a lot easier to do a 3-input method, which only requires your message set to have an entropy of less than ~1.5 bits per character.
Just thought I’d point that out, as it might be something worth thinking about.
Already done; see Dasher and especially its Google Tech Talk.
It doesn’t reach the 0.7-1 bit per character limit, of course, but then, according to the Hutter challenge no compression program (online or offline) has.
Wow, and Dasher was invented by David MacKay, author of the famous free textbook on information theory!
According to Google Books, the textbook mentions Dasher, too.
This is already exploited on cell phones to some extent.
SilasBarta:
One way to achieve this (though not practical for use in human use interfaces) would be to input the entire message bit by bit in some powerful lossless compression format optimized specifically for English text, and decompress it at the end of input. This way, you’d eliminate as much redundancy in your input as the compression algorithm is capable of removing.
The really interesting question, of course, is what are the limits of such technologies in practical applications. But if anyone has an original idea there, they’d likely cash in on it rather than post it here.
Shannon’s estimate of 0.6 to 1.3 was based on having humans guess the next character out a 27 character alphabet including spaces but no other punctuation.
The impractical leading algorithm achieves 1.3 bits per byte on the first 10^8 bytes of wikipedia. This page says that stripping wikipedia down to a simple alphabet doesn’t affect compression ratios much. I think that means that it hits Shannon’s upper estimate. But it’s not normal text (eg, redirects), so I’m not sure in which way its entropy differs. The practical (for computer, not human) algorithm bzip2 achieves 2.3 bits per byte on wikipedia and I find it achieves 2.1 bits per character on normal text (which suggests that wikipedia has more entropy and thus that the leading algorithm is beating Shannon’s estimate).
Since Sniffnoy asked about arithmetic coding: if I understand correctly, this page claims that arithmetic coding of characters achieves 4 bits per character and 2.8 bits per character if the alphabet is 4-tuples.
bzip2 is known to be both slow and not too great at compression; what does lzma-2 (faster & smaller) get you on Wikipedia?
(Also, I would expect redirects to play in a compression algorithm’s favor compared to natural language. A redirect almost always takes the stereotypical form
#REDIRECT[[foo]]
or#redirect[[foo]]
. It would have difficulty compressing the target, frequently a proper name, but the other 13 characters? Pure gravy.)Here are the numbers for a pre-LZMA2 version of 7zip. It looks like LZMA is 2.0 bits per byte, while some other option is 1.7 bits per byte.
Yes, I would expect wikipedia to compress more than text, but it doesn’t seem to be so. This is just for the first 100MB. At a gig, all compression programs do dramatically better, even off-the-shelf ones that shouldn’t window that far. Maybe there is a lot of random vandalism early in the alphabet?
Well, early on there are many weirdly titled pages, and I could imagine that the first 100MB includes all the ’1958 in British Tennis’-style year articles. But intuitively that doesn’t feel like enough to cause bad results.
Nor have any of the articles or theses I’ve read on vandalism detection noted any unusual distributions of vandalism; further, obvious vandalism like gibberish/high-entropy-strings are the very least long-lived forms of vandalism—long-lived vandalism looks plausible & correct, and indistinguishable from normal English even to native speakers (much less a compression algorithm).
A window really does sound like the best explanation, until someone tries out 100MB chunks from other areas of Wikipedia and finds they compress comparably to 1GB.
bzip’s window is 900k, yet it compresses 100MB to 29% but 1GB to 25%. Increasing the memory on 7zip’s PPM makes a larger difference on 1GB than 100MB, so maybe it’s the window that’s relevant there, but it doesn’t seem very plausible to me. (18.5% → 17.8% vs 21.3% → 21.1%)
Sporting lists might compress badly, especially if they contain times, but this one seems to compress well.
That’s very odd. If you ever find out what is going on here, I’d appreciate knowing.
Doesn’t arithmetic coding accomplish this? Or does that not count because it’s unlikely a human could actually use it?
I don’t think arithmetic coding achieves the 1 bit / character theoretical entropy of common English, as that requires knowledge of very complex boundaries in the probability distribution. If you know a color word is coming next, you can capitalize on it, but not letterwise.
Of course, if you permit a large enough block size, then it could work, but the lookup table would probably be umanageable.
Yeah, I meant “arithmetic encoding with absurdly large block size”; I don’t have a practical solution.