Define a set of software specifications. For example, you could specify that the software needs to correctly implement a board game.
Players submit implementations that must meet the specifications; a submission that does not meet the specification is considered incorrect and automatically disqualified.
Submissions that correctly implement the specification are compressed. The winner is the implemetation with the shortest compressed file size.
I think this game would be fun, but to make it really interesting you would want to use language-specific compressors instead of gzip. For example, you would want the Java-specific compressor to know about things like abstract classes, public/private access notations, and other aspects of Java syntax. So the compressor would actually be closely related to the compiler.
I think this kind of quantitative approach would correspond closely to a lot of our intuitions about what good software looks like. For example, the compression principle would prompt people to make most methods private, because that would allow the compressor to save bits when encoding object method calls. It would also promote code reuse for obvious reasons.
Code golfing is basically this, without the compression. You can see some examples here. The result matches very badly with our intuition of good code. Even with language-specific compression, which would presumably allow comments and reasonable variable names without any loss, I would still the optimal code to be bad from a conventional point of view.
What bad coding habits do you think trying to optimize for compressibility would promote? I can think of about 10 good habits it would promote, the most important being Don’t Repeat Yourself.
Obviously you need to avoid penalizing for good variable names and comments.
What bad coding habits do you think trying to optimize for compressibility would promote?
Being easily understandable by humans.
Knuth: “The best programs are written so that computing machines can perform them quickly and so that human beings can understand them clearly. A programmer is ideally an essayist who works with traditional aesthetic and literary forms as well as mathematical concepts, to communicate the way that an algorithm works and to convince a reader that the results will be correct.”
Bad interfaces: Functions need only work on the inputs which are actually given to them in practice. They may behave unexpectedly for valid input, and they definitely won’t give useful error messages for invalid input. They may have ad-hoc alterations to their behavior because it is useful for the particular domain.
Bad idioms: There is no disincentive to use unfamiliar features and unconventional patterns which will confusing anybody trying to read the code. The canonical way to do something is often not the most efficient one.
You can see these features in code golfing.
More abstractly, good code is defined by a broad range of criteria, and optimizing for only one feature is likely to do less well than intentionally setting out to write good code.
On the other hand, STEPS is project designed to make a working personal computer ins as little lines of code as possible. According to their report, making shorter code also results in simpler design. This is evidence in favor of your conclusion.
The point about good error messages seems right; I would deal with this by saying that the specification should require that the code give meaningful error messages when called on invalid input. I’m not sure what you mean by ad-hoc alterations; it seems to me that the compression principle discourages the use of ad-hoc alterations and that’s exactly the point.
Here are some good cases where the compression principle would agree with standard good coding practices:
Don’t Repeat Yourself
Use the least public access modifiers possible (e.g. prefer protected to public and private to protected in Java)
Avoid allowing lots of variables to hang around in scope; if you find yourself with more than 5 or 6 variables in a method, pull some of them out into another method.
Avoid use of string or integer literals, especially if they are reused.
Avoid “old school” for loops like (for(int i =0; i < 10; i++)); prefer for(int i : range(10))
Prefer object variables to static variables and method-scope variables to object variables
Use the weakest possible assumptions about method argument type: don’t define a method to take a TreeSet argument if it only needs to know that the argument is a Collection
Import only the external libraries you need. Don’t import java.util.*; use import java.util.List and java.util.Map
Reduce dependence on third party libraries
Used mixed type method signatures. If a method takes a String, Integer, and FooBar arguments, and you happen to have a String, an Integer, and a FooBar in scope when you call the method, the compressor can infer the order in which the arguments are placed. If the method takes four String arguments, you are out of luck.
Don’t import java.util.*; use import java.util.List and java.util.Map
I am curious about the compression algorithm that would compress “import java.util.List; import java.util.Map” to a shorter string than “import java.util.*;”.
Unless you would artificially make it compress the latter very inefficiently—but, of course, using that kind of strategy you could “prove” almost anything.
It’s not the import statements that get compressed, it’s the code beneath the imports that gets better compression because you limited the number of classes you imported. The compressor can compress references to class names to log(N) bits, where N is the number of class names that are in scope/imported.
I interpret Daniel_Burfoot’s idea as: “import java.util.*” makes subsequent mentions of List longer, since there are more symbols in scope that it has to be distinguished from.
But I don’t think that idea actually works. You can decompose the probability of a conjunction into a product of conditional probabilities, and you get the same number regardless of the order of said decomposition. Whatever probability (and corresponding total compressed size) you assign to a certain sequence of imports and symbols, you could just as well record the symbols first and then the imports. In which case by the time you get to the imports you already know that the program didn’t invoke any utils other than List and Map, so even being able to represent the distinction between the two forms of import is counterproductive for compression.
The intuition of “there are more symbols in scope that it has to be distinguished from” goes wrong because it fails to account for updating a probability distribution over what symbol comes next based on what symbols you’ve seen. Such an update can include knowledge of which symbols come in the same header, if that’s correlated with which symbols are likely to appear in the same program.
You’re totally right about the fact that the compressor could change the order of the raw text in the encoding format, and could infer the import list from the main code body, and encode class name references based on the number of previous occurrences of the class name in the code body. It’s not clear to me a priori that this will actually give better compression in practice, but it’s possible.
But even if that’s true, the main point still holds: limiting the number of external classes you use allows the compressor to same bits.
I believe his imagined compression scheme compresses variables so that when there are N variables in scope, it takes log(N) bits to refer to a variable, and similarly for methods.
If the source code compressor is well-written (and specialized to the language in question), then the programmer can achieve compression by using the practices mentioned above. itaibn0 has the right idea.
Code golfing encourages factoring out coincidental (as opposed to fundamental) redundancy, which is bad for understandability/maintainability. For example, there are C golf programs that read (coincidentally useful) characters from their own source to avoid having to include string literals. Basically, golfed code tends to have fragile interdependence between parts (high coupling), which makes it hard to extend.
Usually, with contests like this, you have to include size of the compression software in the program itself. Basically, you make an executable file that decompresses itself into the game. For examle, .kkrieger is a first-person-shooter with rules that are 96k in size. You simply download the game and run it.
Here’s a fun game:
Define a set of software specifications. For example, you could specify that the software needs to correctly implement a board game.
Players submit implementations that must meet the specifications; a submission that does not meet the specification is considered incorrect and automatically disqualified.
Submissions that correctly implement the specification are compressed. The winner is the implemetation with the shortest compressed file size.
I think this game would be fun, but to make it really interesting you would want to use language-specific compressors instead of gzip. For example, you would want the Java-specific compressor to know about things like abstract classes, public/private access notations, and other aspects of Java syntax. So the compressor would actually be closely related to the compiler.
I think this kind of quantitative approach would correspond closely to a lot of our intuitions about what good software looks like. For example, the compression principle would prompt people to make most methods private, because that would allow the compressor to save bits when encoding object method calls. It would also promote code reuse for obvious reasons.
Code golfing is basically this, without the compression. You can see some examples here. The result matches very badly with our intuition of good code. Even with language-specific compression, which would presumably allow comments and reasonable variable names without any loss, I would still the optimal code to be bad from a conventional point of view.
What bad coding habits do you think trying to optimize for compressibility would promote? I can think of about 10 good habits it would promote, the most important being Don’t Repeat Yourself.
Obviously you need to avoid penalizing for good variable names and comments.
Being easily understandable by humans.
Knuth: “The best programs are written so that computing machines can perform them quickly and so that human beings can understand them clearly. A programmer is ideally an essayist who works with traditional aesthetic and literary forms as well as mathematical concepts, to communicate the way that an algorithm works and to convince a reader that the results will be correct.”
I’m thinking in particular of two bad features:
Bad interfaces: Functions need only work on the inputs which are actually given to them in practice. They may behave unexpectedly for valid input, and they definitely won’t give useful error messages for invalid input. They may have ad-hoc alterations to their behavior because it is useful for the particular domain.
Bad idioms: There is no disincentive to use unfamiliar features and unconventional patterns which will confusing anybody trying to read the code. The canonical way to do something is often not the most efficient one.
You can see these features in code golfing.
More abstractly, good code is defined by a broad range of criteria, and optimizing for only one feature is likely to do less well than intentionally setting out to write good code.
On the other hand, STEPS is project designed to make a working personal computer ins as little lines of code as possible. According to their report, making shorter code also results in simpler design. This is evidence in favor of your conclusion.
The point about good error messages seems right; I would deal with this by saying that the specification should require that the code give meaningful error messages when called on invalid input. I’m not sure what you mean by ad-hoc alterations; it seems to me that the compression principle discourages the use of ad-hoc alterations and that’s exactly the point.
Here are some good cases where the compression principle would agree with standard good coding practices:
Don’t Repeat Yourself
Use the least public access modifiers possible (e.g. prefer protected to public and private to protected in Java)
Avoid allowing lots of variables to hang around in scope; if you find yourself with more than 5 or 6 variables in a method, pull some of them out into another method.
Avoid use of string or integer literals, especially if they are reused.
Avoid “old school” for loops like (for(int i =0; i < 10; i++)); prefer for(int i : range(10))
Prefer object variables to static variables and method-scope variables to object variables
Use the weakest possible assumptions about method argument type: don’t define a method to take a TreeSet argument if it only needs to know that the argument is a Collection
Import only the external libraries you need. Don’t import java.util.*; use import java.util.List and java.util.Map
Reduce dependence on third party libraries
Used mixed type method signatures. If a method takes a String, Integer, and FooBar arguments, and you happen to have a String, an Integer, and a FooBar in scope when you call the method, the compressor can infer the order in which the arguments are placed. If the method takes four String arguments, you are out of luck.
I am curious about the compression algorithm that would compress “import java.util.List; import java.util.Map” to a shorter string than “import java.util.*;”.
Unless you would artificially make it compress the latter very inefficiently—but, of course, using that kind of strategy you could “prove” almost anything.
It’s not the import statements that get compressed, it’s the code beneath the imports that gets better compression because you limited the number of classes you imported. The compressor can compress references to class names to log(N) bits, where N is the number of class names that are in scope/imported.
I interpret Daniel_Burfoot’s idea as: “import java.util.*” makes subsequent mentions of List longer, since there are more symbols in scope that it has to be distinguished from.
But I don’t think that idea actually works. You can decompose the probability of a conjunction into a product of conditional probabilities, and you get the same number regardless of the order of said decomposition. Whatever probability (and corresponding total compressed size) you assign to a certain sequence of imports and symbols, you could just as well record the symbols first and then the imports. In which case by the time you get to the imports you already know that the program didn’t invoke any utils other than List and Map, so even being able to represent the distinction between the two forms of import is counterproductive for compression.
The intuition of “there are more symbols in scope that it has to be distinguished from” goes wrong because it fails to account for updating a probability distribution over what symbol comes next based on what symbols you’ve seen. Such an update can include knowledge of which symbols come in the same header, if that’s correlated with which symbols are likely to appear in the same program.
You’re totally right about the fact that the compressor could change the order of the raw text in the encoding format, and could infer the import list from the main code body, and encode class name references based on the number of previous occurrences of the class name in the code body. It’s not clear to me a priori that this will actually give better compression in practice, but it’s possible.
But even if that’s true, the main point still holds: limiting the number of external classes you use allows the compressor to same bits.
Well, only three of your practices have anything to do with compression.
I believe his imagined compression scheme compresses variables so that when there are N variables in scope, it takes log(N) bits to refer to a variable, and similarly for methods.
If the source code compressor is well-written (and specialized to the language in question), then the programmer can achieve compression by using the practices mentioned above. itaibn0 has the right idea.
Code golfing encourages factoring out coincidental (as opposed to fundamental) redundancy, which is bad for understandability/maintainability. For example, there are C golf programs that read (coincidentally useful) characters from their own source to avoid having to include string literals. Basically, golfed code tends to have fragile interdependence between parts (high coupling), which makes it hard to extend.
Usually, with contests like this, you have to include size of the compression software in the program itself. Basically, you make an executable file that decompresses itself into the game. For examle, .kkrieger is a first-person-shooter with rules that are 96k in size. You simply download the game and run it.
If the compressor is neutral (gzip) or language-specific but not task-specific (as Daniel_Burfoot proposes) you should be able to ignore it.