I am working on large scale lossless data compression, specifically text compression. I chose this problem after a long and roundabout process of philosophical-statistical reflection.
My goal is to demonstrate what I call the “reusability hypothesis”, which states that the tools/theories that are useful for compressing a database are also useful for all other questions of interest regarding the database. For example, in relation to text, the hypothesis suggests that knowledge of POS tagging, parsing, and topic recognition (typical problems in SNLP) are also useful for compression.
If this hypothesis is true, it has important implications. Because the compression problem “includes” many other problems, researchers can simply focus their efforts on compression, and they will solve the other problems of interest along the way. This means, among other things, that there is no point in building labeled datasets such as the Berkeley Segmentation Dataset or the Penn Treebank.
Perhaps a better question is: how much overlap does RH have with “empirical science”? My answer is: a lot. By doing empirical science—that is, open-ended, curiousity-motivated research whose goal is to obtain a single optimal description of a particular phenomenon—it just so happens that one also finds theories of direct relevance for practical applications. The same theory that describes the motion of the planets and velocities of balls rolling down inclined planes is also useful for building bridges (reusability). All mainstream empirical scientists recognize this; no one in computer vision or computational linguistics does.
The link to MDL is that to do lossless compression, one must do empirical science. This is because of the NFL theorem that says lossless compression is impossible for general inputs. The only way to achieve compression is to discover empirical regularities in the data and then exploit those regularities. For example, the PNG image format exploits the obvious regularity that the values of adjacent image pixels tend to be correlated.
So MDL implies empirical science which implies RH.
Did you reverse this? e.g. compression is useful for POS tagging, etc.
No, the original is correct. Let’s say you want to compress a text corpus. Then you need a good probability model P(s), of the probability of a sentence, that assigns high probability to the sentences in the corpus. Since most sentences follow grammatical rules, the model P(s) will need to reflect those rules in some way. The better you understand the rules, the better your model will be, and the shorter the resulting codelength. So because parsing (and POS tagging, etc) is useful for compression, the compression principle allows you to evaluate your understanding of parsing (etc).
That’s what other compression advocates want to do. In fact, it’s pretty obvious that this is true—if you found an algorithm capable of computing the Kolmogorov complexity of any input string, you would pretty much put all theoretical scientists out of work. Unfortunately or not, this problem is uncomputable. Some people (Hutter et al) think, however, that it is possible to find good general purpose algorithms for computing approximations to the K-complexity.
I disagree—I think such algorithms are far out of reach. To me, the compression idea is interesting because it provides a new way to evaluate one’s understanding of various phenomena. I want to study text compression because I am interested in text, not because I think it will lead to some grand all-purpose algorithm. This is in the spirit of The Virtue of Narrowness.
You seem to be speculating that a solution to parsing is going to pop out of understanding compression. Your argument is that parsing is essentially a problem of modelling P(s), and compression is just that. But I’m skeptical that the compression angle is going to give you special insight. People have been keenly interested in language models since approximately forever, since they’re critical for translation and other applications. They’re pretty much the bedrock of the field.
I’d note though that text compression would be a useful evaluation technique for language technologies. There are lots of competing encoding schemes for grammars, and they’re difficult to compare them, which makes tools difficult to compare. “Which one compresses text best” seems like a very good solution. I don’t know whether it’s been proposed before. It sounds familiar, but that might just be because good ideas have that sort of resonance sometimes.
Would you say that compression ‘includes’ many other problems or that compression and those problems (I assume you mean things like statistics etc.) are isomorphic (or partially anyway)? Why do you recommend focusing on compression and applying compression techniques to other fields instead of vice versa or both at once? Is there a reason you see compression as more fundamental?
I would say that if you understand phenomenon X, you should be able to build a specialized compressor for phenomenon X that will achieve a short codelength. The study of compression algorithms for datasets produced by phenomenon X is effectively identical to the study of phenomenon X. Furthermore compression provides a strong methodological advantage: there can be no hand-waving; it is impossible to “cheat” or self-deceive.
Compression is cognition. Gerry Wolf has also been saying this for many years. I haven’t studied his work closely enough or recently enough to comment on it.
Lots of people promote the compression idea. But as far as I know, with the exception of the Hutter Prize, no one has actually attempted to do large scale compression, in any kind of systematic way. People in SNLP talk about language modeling, but they are not systematic (no shared benchmarks, no competitions, etc). No one in computer vision attempts to do large scale image compression.
I just spent an enjoyable lunch break reading about the Hutter Prize. Do you know if the paper with a broken link from a few places in its faq which seems to be disparaging Occam’s Razor is still online anywhere? A mailing list entry is the most complete summary of it I could find.
I am working on large scale lossless data compression, specifically text compression. I chose this problem after a long and roundabout process of philosophical-statistical reflection.
My goal is to demonstrate what I call the “reusability hypothesis”, which states that the tools/theories that are useful for compressing a database are also useful for all other questions of interest regarding the database. For example, in relation to text, the hypothesis suggests that knowledge of POS tagging, parsing, and topic recognition (typical problems in SNLP) are also useful for compression.
If this hypothesis is true, it has important implications. Because the compression problem “includes” many other problems, researchers can simply focus their efforts on compression, and they will solve the other problems of interest along the way. This means, among other things, that there is no point in building labeled datasets such as the Berkeley Segmentation Dataset or the Penn Treebank.
How much overlap does the “reusability hypothesis” have with the “minimal description length principle”?
Perhaps a better question is: how much overlap does RH have with “empirical science”? My answer is: a lot. By doing empirical science—that is, open-ended, curiousity-motivated research whose goal is to obtain a single optimal description of a particular phenomenon—it just so happens that one also finds theories of direct relevance for practical applications. The same theory that describes the motion of the planets and velocities of balls rolling down inclined planes is also useful for building bridges (reusability). All mainstream empirical scientists recognize this; no one in computer vision or computational linguistics does.
The link to MDL is that to do lossless compression, one must do empirical science. This is because of the NFL theorem that says lossless compression is impossible for general inputs. The only way to achieve compression is to discover empirical regularities in the data and then exploit those regularities. For example, the PNG image format exploits the obvious regularity that the values of adjacent image pixels tend to be correlated.
So MDL implies empirical science which implies RH.
Did you reverse this? e.g. compression is useful for POS tagging, etc.
No, the original is correct. Let’s say you want to compress a text corpus. Then you need a good probability model P(s), of the probability of a sentence, that assigns high probability to the sentences in the corpus. Since most sentences follow grammatical rules, the model P(s) will need to reflect those rules in some way. The better you understand the rules, the better your model will be, and the shorter the resulting codelength. So because parsing (and POS tagging, etc) is useful for compression, the compression principle allows you to evaluate your understanding of parsing (etc).
If your project works, then, it will reduce science to a problem in NP?
That’s what other compression advocates want to do. In fact, it’s pretty obvious that this is true—if you found an algorithm capable of computing the Kolmogorov complexity of any input string, you would pretty much put all theoretical scientists out of work. Unfortunately or not, this problem is uncomputable. Some people (Hutter et al) think, however, that it is possible to find good general purpose algorithms for computing approximations to the K-complexity.
I disagree—I think such algorithms are far out of reach. To me, the compression idea is interesting because it provides a new way to evaluate one’s understanding of various phenomena. I want to study text compression because I am interested in text, not because I think it will lead to some grand all-purpose algorithm. This is in the spirit of The Virtue of Narrowness.
About the NLP stuff:
You seem to be speculating that a solution to parsing is going to pop out of understanding compression. Your argument is that parsing is essentially a problem of modelling P(s), and compression is just that. But I’m skeptical that the compression angle is going to give you special insight. People have been keenly interested in language models since approximately forever, since they’re critical for translation and other applications. They’re pretty much the bedrock of the field.
I’d note though that text compression would be a useful evaluation technique for language technologies. There are lots of competing encoding schemes for grammars, and they’re difficult to compare them, which makes tools difficult to compare. “Which one compresses text best” seems like a very good solution. I don’t know whether it’s been proposed before. It sounds familiar, but that might just be because good ideas have that sort of resonance sometimes.
Would you say that compression ‘includes’ many other problems or that compression and those problems (I assume you mean things like statistics etc.) are isomorphic (or partially anyway)? Why do you recommend focusing on compression and applying compression techniques to other fields instead of vice versa or both at once? Is there a reason you see compression as more fundamental?
I would say that if you understand phenomenon X, you should be able to build a specialized compressor for phenomenon X that will achieve a short codelength. The study of compression algorithms for datasets produced by phenomenon X is effectively identical to the study of phenomenon X. Furthermore compression provides a strong methodological advantage: there can be no hand-waving; it is impossible to “cheat” or self-deceive.
Compression is cognition. Gerry Wolf has also been saying this for many years. I haven’t studied his work closely enough or recently enough to comment on it.
Lots of people promote the compression idea. But as far as I know, with the exception of the Hutter Prize, no one has actually attempted to do large scale compression, in any kind of systematic way. People in SNLP talk about language modeling, but they are not systematic (no shared benchmarks, no competitions, etc). No one in computer vision attempts to do large scale image compression.
I just spent an enjoyable lunch break reading about the Hutter Prize. Do you know if the paper with a broken link from a few places in its faq which seems to be disparaging Occam’s Razor is still online anywhere? A mailing list entry is the most complete summary of it I could find.
Progress in compression is substantially similar to progress in AI.
http://prize.hutter1.net/