I interpret Daniel_Burfoot’s idea as: “import java.util.*” makes subsequent mentions of List longer, since there are more symbols in scope that it has to be distinguished from.
But I don’t think that idea actually works. You can decompose the probability of a conjunction into a product of conditional probabilities, and you get the same number regardless of the order of said decomposition. Whatever probability (and corresponding total compressed size) you assign to a certain sequence of imports and symbols, you could just as well record the symbols first and then the imports. In which case by the time you get to the imports you already know that the program didn’t invoke any utils other than List and Map, so even being able to represent the distinction between the two forms of import is counterproductive for compression.
The intuition of “there are more symbols in scope that it has to be distinguished from” goes wrong because it fails to account for updating a probability distribution over what symbol comes next based on what symbols you’ve seen. Such an update can include knowledge of which symbols come in the same header, if that’s correlated with which symbols are likely to appear in the same program.
You’re totally right about the fact that the compressor could change the order of the raw text in the encoding format, and could infer the import list from the main code body, and encode class name references based on the number of previous occurrences of the class name in the code body. It’s not clear to me a priori that this will actually give better compression in practice, but it’s possible.
But even if that’s true, the main point still holds: limiting the number of external classes you use allows the compressor to same bits.
I interpret Daniel_Burfoot’s idea as: “import java.util.*” makes subsequent mentions of List longer, since there are more symbols in scope that it has to be distinguished from.
But I don’t think that idea actually works. You can decompose the probability of a conjunction into a product of conditional probabilities, and you get the same number regardless of the order of said decomposition. Whatever probability (and corresponding total compressed size) you assign to a certain sequence of imports and symbols, you could just as well record the symbols first and then the imports. In which case by the time you get to the imports you already know that the program didn’t invoke any utils other than List and Map, so even being able to represent the distinction between the two forms of import is counterproductive for compression.
The intuition of “there are more symbols in scope that it has to be distinguished from” goes wrong because it fails to account for updating a probability distribution over what symbol comes next based on what symbols you’ve seen. Such an update can include knowledge of which symbols come in the same header, if that’s correlated with which symbols are likely to appear in the same program.
You’re totally right about the fact that the compressor could change the order of the raw text in the encoding format, and could infer the import list from the main code body, and encode class name references based on the number of previous occurrences of the class name in the code body. It’s not clear to me a priori that this will actually give better compression in practice, but it’s possible.
But even if that’s true, the main point still holds: limiting the number of external classes you use allows the compressor to same bits.