This is kind of a strange post, but I suppose I might as well try.
I am working on comperical linguistics, which means approaching the problems of linguistics by working on large scale, lossless compression of text data. The rationale for this research is laid out in my book. I’ve made nice progress over the last year or two, but there is still a lot of work to be done. Currently the day-to-day work is about 60% software engineering, 30% linguistic analysis (random recent research observation: there is a important difference between relative clauses where the headword noun functions as the subject vs object of the clause; in one case is it permissible to drop the Wh-word complementizer, in the other case it isn’t) and 10% ML/stats. I am hopeful that the stats/math component will increase in the near term and the SE work will decrease, as I’ve already done a lot of the ground work. If this kind of research sounds interesting to you, let me know and maybe we can figure out a way to collaborate.
More broadly, I would like to encourage people to apply the comperical philosophy to other research domains. The main requirement is that it be possible to obtain a large amount of rich, structured (but not labelled) data. For example, I would love to see someone try to compress the Sloan Digital Sky Survey database. Compressors for this data set will need to have a deep “understanding” of physics and cosmology. The compression principle might be a way to resolve debates about topics like dark matter and black holes. It could also focus attention on segments of the astronomical data that defy conventional explanation (this would be detected as a spike in the codelength for a region). Such abnormalities might indicate the presence of alien activity or new types of astrophysical phenomena.
If you think that this method could be useful for astronomy in the future, can you point to astronomical controversies in the past that it would have helped with?
You can’t resolve debates unless there are actual competing theories. The main problem with dark matter is that people don’t have theories precise enough to compare. When there are competing theories, astronomy appears to me pretty good at isolating the relevant data and comparing the theories.
And I imagine that is why you are working on linguistics, not astronomy. Astronomy has a solid grounding that allows it to process and isolate data, while a lot of linguistic claims require agreement on labeling before they can be assessed, making it easy for people to hide behind subjective labeling.
I can’t explain the whole philosophy here, but basically the idea is: you have two theories, A and B. You instantiate them as lossless data compressors, and invoke the compressors on the dataset. The one that produces a shorter net codelength (including the length of the compressor program itself) is superior. In practice the rival theories will probably be very similar and produce different predictions (= probability distributions over observational outcomes) only on small regions of the dataset.
Lossless data compression is a highly rigorous evaluation principle. Many theories are simply not well-specified enough to be built into compressors; these theories, I say (reformulating Popper and Yudkowsky), should not be considered scientific. If the compressor implementation contains any bugs, these bugs will immediately appear when the decoded data fails to agree exactly with the original data. Finally, if the theory is scientific and the implementation is correct, it still remains to be seen if the theory is empirically accurate, which is required for lossless data compression in the face of the domain’s No Free Lunch theorem.
So say you and I have two rival theories of black hole dynamics. If the theories are different in a scientifically meaningful way, they must make different predictions about some data that could be observed. That means the compressors corresponding to our theories will assign different codelengths to some observations in the dataset. If your theory is more accurate, it will achieve shorter codelengths overall. This could happen by, say, your theory properly accounting for the velocity dispersion of galaxies under the effect of dark matter. Or it could happen by my theory being hit by a big Black Swan penalty because it cannot explain an astronomical jet coming from a black hole.
What about the fact that the best compression algorithm may be insanely expensive to run? We know the math that describes the behavior of quarks, which is to say, we can in principle generate the results of all possible experiments with quarks by solving a few equations. However doing computations with the theory is extremely expensive and it takes something like 10^15 floating point operations to compute, say, some basic properties of the proton to 1% accuracy.
Good point. My answer is: yes, we have to accept a speed/accuracy tradeoff. That doesn’t seem like such a disaster in practice.
Some people, primarily Matt Mahoney, have actually organized data compression contests similar to what I’m advocating. Mahoney’s solution is just to impose a certain time limit that is reasonable but arbitrary. In the future, researchers could develop a spectrum of theories, each of which achieves a non-dominated position on a speed/compression curve. Unless something Very Strange happened, each faster/less accurate theory would be related to its slower/more accurate cousin by a standard suite of approximations. (It would be strange—but interesting—if you could get an accurate and fast theory by doing a nonstandard approximation or introducing some kind of new concept).
This is kind of a strange post, but I suppose I might as well try.
I am working on comperical linguistics, which means approaching the problems of linguistics by working on large scale, lossless compression of text data. The rationale for this research is laid out in my book. I’ve made nice progress over the last year or two, but there is still a lot of work to be done. Currently the day-to-day work is about 60% software engineering, 30% linguistic analysis (random recent research observation: there is a important difference between relative clauses where the headword noun functions as the subject vs object of the clause; in one case is it permissible to drop the Wh-word complementizer, in the other case it isn’t) and 10% ML/stats. I am hopeful that the stats/math component will increase in the near term and the SE work will decrease, as I’ve already done a lot of the ground work. If this kind of research sounds interesting to you, let me know and maybe we can figure out a way to collaborate.
More broadly, I would like to encourage people to apply the comperical philosophy to other research domains. The main requirement is that it be possible to obtain a large amount of rich, structured (but not labelled) data. For example, I would love to see someone try to compress the Sloan Digital Sky Survey database. Compressors for this data set will need to have a deep “understanding” of physics and cosmology. The compression principle might be a way to resolve debates about topics like dark matter and black holes. It could also focus attention on segments of the astronomical data that defy conventional explanation (this would be detected as a spike in the codelength for a region). Such abnormalities might indicate the presence of alien activity or new types of astrophysical phenomena.
If you think that this method could be useful for astronomy in the future, can you point to astronomical controversies in the past that it would have helped with?
You can’t resolve debates unless there are actual competing theories. The main problem with dark matter is that people don’t have theories precise enough to compare. When there are competing theories, astronomy appears to me pretty good at isolating the relevant data and comparing the theories.
And I imagine that is why you are working on linguistics, not astronomy. Astronomy has a solid grounding that allows it to process and isolate data, while a lot of linguistic claims require agreement on labeling before they can be assessed, making it easy for people to hide behind subjective labeling.
I don’t follow… Can you elaborate on how some specific form of compression could do that?
I can’t explain the whole philosophy here, but basically the idea is: you have two theories, A and B. You instantiate them as lossless data compressors, and invoke the compressors on the dataset. The one that produces a shorter net codelength (including the length of the compressor program itself) is superior. In practice the rival theories will probably be very similar and produce different predictions (= probability distributions over observational outcomes) only on small regions of the dataset.
Lossless data compression is a highly rigorous evaluation principle. Many theories are simply not well-specified enough to be built into compressors; these theories, I say (reformulating Popper and Yudkowsky), should not be considered scientific. If the compressor implementation contains any bugs, these bugs will immediately appear when the decoded data fails to agree exactly with the original data. Finally, if the theory is scientific and the implementation is correct, it still remains to be seen if the theory is empirically accurate, which is required for lossless data compression in the face of the domain’s No Free Lunch theorem.
So say you and I have two rival theories of black hole dynamics. If the theories are different in a scientifically meaningful way, they must make different predictions about some data that could be observed. That means the compressors corresponding to our theories will assign different codelengths to some observations in the dataset. If your theory is more accurate, it will achieve shorter codelengths overall. This could happen by, say, your theory properly accounting for the velocity dispersion of galaxies under the effect of dark matter. Or it could happen by my theory being hit by a big Black Swan penalty because it cannot explain an astronomical jet coming from a black hole.
What about the fact that the best compression algorithm may be insanely expensive to run? We know the math that describes the behavior of quarks, which is to say, we can in principle generate the results of all possible experiments with quarks by solving a few equations. However doing computations with the theory is extremely expensive and it takes something like 10^15 floating point operations to compute, say, some basic properties of the proton to 1% accuracy.
Good point. My answer is: yes, we have to accept a speed/accuracy tradeoff. That doesn’t seem like such a disaster in practice.
Some people, primarily Matt Mahoney, have actually organized data compression contests similar to what I’m advocating. Mahoney’s solution is just to impose a certain time limit that is reasonable but arbitrary. In the future, researchers could develop a spectrum of theories, each of which achieves a non-dominated position on a speed/compression curve. Unless something Very Strange happened, each faster/less accurate theory would be related to its slower/more accurate cousin by a standard suite of approximations. (It would be strange—but interesting—if you could get an accurate and fast theory by doing a nonstandard approximation or introducing some kind of new concept).
Thank you, this is almost what I meant. Could you add some details on why you consider this question important over perhaps some others?