In the spirit of informal intellectual inquiry and friendly wagering, and with an eye toward raising a bit of money for SIAI, I offer the following two challenges to the LW community.
Challenge #1 - Bayes’ Nets Skeptics’ Challenge
Many LWers seem to be strong believers in the family of modeling methods variously called Bayes’ Nets, belief networks, or graphical models. These methods are the topic of two SIAI-recommended books by Judea Pearl: “Probabilistic Reasoning in Intelligent Systems” and “Causality: Models, Reasoning and Inference”.
The belief network paradigm has several attractive conceptual features. One feature is the ability of the networks to encode conditional independence relationships, which are intuitively natural and therefore attractive to humans. Often a naïve investigation of the statistical relationship between variables will produce nonsensical conclusions, and the idea of conditional independence can sometimes be used to unravel the mystery. A good example would be a data set relating to traffic accidents, which shows that red cars are more likely to be involved in accidents. But it’s nearly absurd to believe that red cars are intrinsically more dangerous. Rather, red cars are preferred by young men, who tend to be reckless drivers. So the color of a car is not independent of the likelihood of a collision, but it is conditionally independent given the age and sex of the person driving the car. This relationship could be expressed by the following belief network:
Where “YM” is true if the driver is a young man, “R” denotes whether the car is red, and “A” indicates an accident. The fact that there is no edge betwee “R” and “A” indicates that they are conditionally independent given the other nodes in the network, in this case “YM”.
A key property of the belief network scheme of constructing probability distributions is that they can be used to achieve a good balance between expressive power and model complexity. Consider the family of probability distributions over N variables that can take on K different values. Then the most naïve model is just an N-dimensional array of numbers P(X1, X2, … XN), requiring K^N parameters to specify. The number of parameters required for a belief network can be drastically lower, if many of the nodes are conditionally independent of one another. For example, if all but one node in the graph has exactly one parent, then the number of parameters required is basically NK^2 (N conditional distributions of the form P(Xchild|Xparent)). This is clearly a drastic improvement for reasonable values of K and N. Even though the model is much less complex, every variable is still related to every other variable—knowing the value of any Xi will change the probability of any Xj, unless some other intervening node is also known.
Another attractive feature of the belief network paradigm is the existence of a fast inference algorithm for updating the probability distributions of some unobserved variables in response to evidence. For example, given that a patient has the feature “smoker=T” and “chest pain=T”, the inference algorithm can rapidly calculate the probability distribution of the unobserved variable “heart disease”. Unfortunately, there is a big catch—this inference algorithm works only for belief networks that can be expressed as acyclic graphs. If the graph is not acyclic, the computational cost of the inference algorithm is much larger (IIRC, it is exponential in the size of the largest clique in the graph).
In spite of these benefits, belief networks have some serious drawbacks as well. One major flaw is the difficulty of learning networks from data. Here the goal is to obtain a belief network specifying a distribution P(x) that assigns a very high likelihood to an empirical data set X={X1,X2...XM}, where now each X is an N-dimensional vector and there are M samples. The difficulty of learning belief networks stems from the fact that the number of graphs representing relationships between N variables is exponential in N.
There is a second, more subtle flaw in the belief network paradigm. A belief network is defined based on a graph which has one node for each data variable. This is fine, if you know what the correct variables are. But very often the correct variables are unknown; indeed, finding the correct variables is probably the key problem in learning. Once you know that the critical variables are mass, force, and acceleration, it is almost trivial to determine the relation between them. As an example, consider a system composed of three variables A,B,C and X. The unknown relationship between these variables is X~F(A+B+C), where F is some complex distribution that depends on a single parameter. Now, a naïve representation will yield a belief network that looks like the following:
If we reencode the variables as A’=A, B’=A’+B, C’=B’+C, then we get the following graph:
If you count the number of links this might not seem like much of an improvement. But the number of links is not the important factor; the key factor is the required number of entries in the conditional probability tables. In the first graph, the CPT for X requires O(K^4) entries, where K is again the number of values A, B, and C can take on. But in the second graph the number of entries is O(K^2). So clearly the second graph should be preferred, but the basic belief network learning algorithms provide no way of obtaining it.
On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.
The challenge is to prove the above claim wrong. This can be done as follows. First, find a real data set (see below for definition of the word “real”). Construct a belief network model of the data, in any way you choose. Then post a link to the data set, and I will then attempt to model it using alternative methods of my own choosing (probably Maximum Entropy or a variant thereof). We will then compare the likelihoods achieved by the two methods; higher likelihood wins. If there is ambiguity concerning the validity of the result, then we will compress the data set using compression algorithms based on the models and compare compression rates. Constructing a compressor from a statistical model is essentially a technical exercise; I can provide a Java implementation of arithmetic encoding. The compression rates must take into account the size of the compressor itself.
The challenge hinges on the meaning of the word “real data”. Obviously it is trivial to construct a data set for which a belief network is the best possible model, simply by building a network and then sampling from it. So my requirement is that the data set be non-synthetic. Other than that, there are no limitations—it can be image data, text, speech, machine learning sets, NetFlix, social science databases, etc.
To make things interesting, the loser of the challenge will donate $100 (more?) to SIAI. Hopefully we can agree on the challenge (but not necessarily resolve it) before the Feb. 28th deadline for matching donations. In principle I will accept challenges until I lose so many that my wallet starts to hurt.
Challenge #2: Compress the GSS
The General Social Survey is a widely used data set in the social sciences. Most analyses based on the GSS tend to use standard statistical tools such as correlation, analysis of variance, and so on. These kinds of analysis run into the usual problems associated with statistics—how do you choose a prior, how do you avoid overfitting, and so on.
I propose a new way of analyzing the data in the GSS—a compression-based challenge as outlined above. To participate in the challenge, you build a model of the data contained in the GSS using whatever methods appeal to you. Then you connect the model to an encoding method and compress the dataset. Whoever achieves the best compression rate, taking into account the size of the compressor itself, is the winner.
The GSS contains data about a great variety of economic, cultural, psychological, and educational factors. If you are a social scientist with a theory of how these factors relate, you can prove your theory by transforming it into a statistical model and then into a compression program, and demonstrating better compression results than rival theories.
If people are interested, I propose the following scheme. There is a $100 entry fee, of which half will go to SIAI. The other half goes to the winner. Again, hopefully we can agree on the challenge before the Feb. 28th deadline.
Two Challenges
Followup To: Play for a Cause, Singularity Institute $100k Challenge Grant
In the spirit of informal intellectual inquiry and friendly wagering, and with an eye toward raising a bit of money for SIAI, I offer the following two challenges to the LW community.
Challenge #1 - Bayes’ Nets Skeptics’ Challenge
Many LWers seem to be strong believers in the family of modeling methods variously called Bayes’ Nets, belief networks, or graphical models. These methods are the topic of two SIAI-recommended books by Judea Pearl: “Probabilistic Reasoning in Intelligent Systems” and “Causality: Models, Reasoning and Inference”.
The belief network paradigm has several attractive conceptual features. One feature is the ability of the networks to encode conditional independence relationships, which are intuitively natural and therefore attractive to humans. Often a naïve investigation of the statistical relationship between variables will produce nonsensical conclusions, and the idea of conditional independence can sometimes be used to unravel the mystery. A good example would be a data set relating to traffic accidents, which shows that red cars are more likely to be involved in accidents. But it’s nearly absurd to believe that red cars are intrinsically more dangerous. Rather, red cars are preferred by young men, who tend to be reckless drivers. So the color of a car is not independent of the likelihood of a collision, but it is conditionally independent given the age and sex of the person driving the car. This relationship could be expressed by the following belief network:
Where “YM” is true if the driver is a young man, “R” denotes whether the car is red, and “A” indicates an accident. The fact that there is no edge betwee “R” and “A” indicates that they are conditionally independent given the other nodes in the network, in this case “YM”.
A key property of the belief network scheme of constructing probability distributions is that they can be used to achieve a good balance between expressive power and model complexity. Consider the family of probability distributions over N variables that can take on K different values. Then the most naïve model is just an N-dimensional array of numbers P(X1, X2, … XN), requiring K^N parameters to specify. The number of parameters required for a belief network can be drastically lower, if many of the nodes are conditionally independent of one another. For example, if all but one node in the graph has exactly one parent, then the number of parameters required is basically NK^2 (N conditional distributions of the form P(Xchild|Xparent)). This is clearly a drastic improvement for reasonable values of K and N. Even though the model is much less complex, every variable is still related to every other variable—knowing the value of any Xi will change the probability of any Xj, unless some other intervening node is also known.
Another attractive feature of the belief network paradigm is the existence of a fast inference algorithm for updating the probability distributions of some unobserved variables in response to evidence. For example, given that a patient has the feature “smoker=T” and “chest pain=T”, the inference algorithm can rapidly calculate the probability distribution of the unobserved variable “heart disease”. Unfortunately, there is a big catch—this inference algorithm works only for belief networks that can be expressed as acyclic graphs. If the graph is not acyclic, the computational cost of the inference algorithm is much larger (IIRC, it is exponential in the size of the largest clique in the graph).
In spite of these benefits, belief networks have some serious drawbacks as well. One major flaw is the difficulty of learning networks from data. Here the goal is to obtain a belief network specifying a distribution P(x) that assigns a very high likelihood to an empirical data set X={X1,X2...XM}, where now each X is an N-dimensional vector and there are M samples. The difficulty of learning belief networks stems from the fact that the number of graphs representing relationships between N variables is exponential in N.
There is a second, more subtle flaw in the belief network paradigm. A belief network is defined based on a graph which has one node for each data variable. This is fine, if you know what the correct variables are. But very often the correct variables are unknown; indeed, finding the correct variables is probably the key problem in learning. Once you know that the critical variables are mass, force, and acceleration, it is almost trivial to determine the relation between them. As an example, consider a system composed of three variables A,B,C and X. The unknown relationship between these variables is X~F(A+B+C), where F is some complex distribution that depends on a single parameter. Now, a naïve representation will yield a belief network that looks like the following:
If we reencode the variables as A’=A, B’=A’+B, C’=B’+C, then we get the following graph:
If you count the number of links this might not seem like much of an improvement. But the number of links is not the important factor; the key factor is the required number of entries in the conditional probability tables. In the first graph, the CPT for X requires O(K^4) entries, where K is again the number of values A, B, and C can take on. But in the second graph the number of entries is O(K^2). So clearly the second graph should be preferred, but the basic belief network learning algorithms provide no way of obtaining it.
On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.
The challenge is to prove the above claim wrong. This can be done as follows. First, find a real data set (see below for definition of the word “real”). Construct a belief network model of the data, in any way you choose. Then post a link to the data set, and I will then attempt to model it using alternative methods of my own choosing (probably Maximum Entropy or a variant thereof). We will then compare the likelihoods achieved by the two methods; higher likelihood wins. If there is ambiguity concerning the validity of the result, then we will compress the data set using compression algorithms based on the models and compare compression rates. Constructing a compressor from a statistical model is essentially a technical exercise; I can provide a Java implementation of arithmetic encoding. The compression rates must take into account the size of the compressor itself.
The challenge hinges on the meaning of the word “real data”. Obviously it is trivial to construct a data set for which a belief network is the best possible model, simply by building a network and then sampling from it. So my requirement is that the data set be non-synthetic. Other than that, there are no limitations—it can be image data, text, speech, machine learning sets, NetFlix, social science databases, etc.
To make things interesting, the loser of the challenge will donate $100 (more?) to SIAI. Hopefully we can agree on the challenge (but not necessarily resolve it) before the Feb. 28th deadline for matching donations. In principle I will accept challenges until I lose so many that my wallet starts to hurt.
Challenge #2: Compress the GSS
The General Social Survey is a widely used data set in the social sciences. Most analyses based on the GSS tend to use standard statistical tools such as correlation, analysis of variance, and so on. These kinds of analysis run into the usual problems associated with statistics—how do you choose a prior, how do you avoid overfitting, and so on.
I propose a new way of analyzing the data in the GSS—a compression-based challenge as outlined above. To participate in the challenge, you build a model of the data contained in the GSS using whatever methods appeal to you. Then you connect the model to an encoding method and compress the dataset. Whoever achieves the best compression rate, taking into account the size of the compressor itself, is the winner.
The GSS contains data about a great variety of economic, cultural, psychological, and educational factors. If you are a social scientist with a theory of how these factors relate, you can prove your theory by transforming it into a statistical model and then into a compression program, and demonstrating better compression results than rival theories.
If people are interested, I propose the following scheme. There is a $100 entry fee, of which half will go to SIAI. The other half goes to the winner. Again, hopefully we can agree on the challenge before the Feb. 28th deadline.