“Likelihood” is ambiguous: [...] This Bayes net would have a low probability from a prior belief distribution over Bayes nets but a high likelihood to the data.
Right—this would be what I’d call “cheating” or overfitting the data. We’d have to use the compression rate in this case.
To define compression unambiguously, you should agree on a programming language or executable format, and on runtime and memory bounds on a reference computer
Sure. I’ll work out the technical details if anyone wants to enter the contest. I would prefer to use the most recent stable JVM. It seems very unlikely to me that the outcome of the contest will depend on precise selection of time or memory bounds—let’s say, the time bound is O(24 hours) and the memory bound in O(2 GB).
An alternative to a test of compression
It’s actually not very difficult to implement a compression program using arithmetic coding once you have the statistical model. Other prediction evaluation schemes may work, but compression has methodological crispness: look at the compressed file size, check that the decompressed data matches the original exactly.
Would it be in the spirit of the challenge for the Bayes Net contestant to use these more general models? Would it be out of the spirit of the challenge for the data to be about such a collection of objects?
Basically, when I say “belief networks”, what I mean is the use of graphs to define probability distributions and conditional independence relationships.
The spirit of the contest is to use a truly “natural” data set. I admit that this is a bit vague. Really my only requirement is to use a non-synthetic data set. I think I know where you’re going with the “causally dependent” line of thinking, but it doesn’t bother me too much. I get the feeling that I am walking into a trap, but really I’ve been planning to make a donation to SIAI anyway, so I don’t mind losing.
Right—this would be what I’d call “cheating” or overfitting the data. We’d have to use the compression rate in this case.
Sure. I’ll work out the technical details if anyone wants to enter the contest. I would prefer to use the most recent stable JVM. It seems very unlikely to me that the outcome of the contest will depend on precise selection of time or memory bounds—let’s say, the time bound is O(24 hours) and the memory bound in O(2 GB).
It’s actually not very difficult to implement a compression program using arithmetic coding once you have the statistical model. Other prediction evaluation schemes may work, but compression has methodological crispness: look at the compressed file size, check that the decompressed data matches the original exactly.
Basically, when I say “belief networks”, what I mean is the use of graphs to define probability distributions and conditional independence relationships.
The spirit of the contest is to use a truly “natural” data set. I admit that this is a bit vague. Really my only requirement is to use a non-synthetic data set. I think I know where you’re going with the “causally dependent” line of thinking, but it doesn’t bother me too much. I get the feeling that I am walking into a trap, but really I’ve been planning to make a donation to SIAI anyway, so I don’t mind losing.