There is a lot more to say about the perspective that isn’t relaxed to continuous random variables. In particular, the problem of finding the maximum entropy joint distribution that agrees with particular pairwise distributions is closely related to Markov Random Fields and the Ising model. (The relaxation to continuous random variables is a Gaussian Markov Random Field.) It is easily seen that this maximum entropy joint distribution must have the form logPr(1φ1,…,1φn)=∑i<jθij1φi∧φj+∑iθi1φi−logZ where logZ is the normalizing constant, or partition function. This is an appealing distribution to use, and easy to do conditioning on and to add new variables to. Computing relative entropy reduces to finding bivariate marginals and to computing Z, and computing marginals reduces to computing Z, which is intractable in general[^istrail], though easy if the Markov graph (ie the graph with edges ij for θi,j≠0) is a forest.
There have been many approaches to this problem (Wainwright and Jordan[^wainwright] is a good survey), but the main ways to extend the applicability from forests have been:
decompose components of the graph as “junction trees”, ie trees whose nodes are overlapping clusters of nodes from the original graph; this permits exact computation with cost exponential in the cluster-sizes, ie in the treewidth[^pearl]
make use of clever combinatorial work coming out of statistical mechanics to do exact computation on “outerplanar” graphs, or on general graphs with cost exponential in the (outer-)graph genus[^schraudolph]
find nodes such that conditioning on those nodes greatly simplifies the graph (eg makes it singly connected), and sum over their possible values explicitly (this has cost exponential in the number of nodes being conditioned on)
A newer class of models, called sum-product networks[^poon], generalizes these and other models by writing the total joint probability as a positive polynomial 1=∑1x1,…,xn=0Pr(1φ1=x1,…,1φn=xn)1x1φ111−x1¯φ1…1xnφn11−xn¯φn in the variables 1φ1,1¯φ1,…,1φn,1¯φn and requiring only that this polynomial be simplifiable to an expression requiring a tractable number of additions and multiplications to evaluate. This allows easy computation of marginals, conditionals, and KL divergence, though it will likely be necessary to do some approximate simplification every so often (otherwise the complexity may accumulate, even with a fixed maximum number of sentences being considered at a time).
However, if we want to stay close to the context of the Non-Omniscience paper, we can do approximate calculations of the partition function on the complete graph—in particular, the Bethe partition function[^weller] has been widely used in practice, and while it’s not logconvex like Z is, it’s often a better approximation to the partition function than well-known convex approximations such as TRW.
[^poon]: Poon, Hoifung, and Pedro Domingos. “Sum-product networks: A new deep architecture.” In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689-690. IEEE, 2011.
There is a lot more to say about the perspective that isn’t relaxed to continuous random variables. In particular, the problem of finding the maximum entropy joint distribution that agrees with particular pairwise distributions is closely related to Markov Random Fields and the Ising model. (The relaxation to continuous random variables is a Gaussian Markov Random Field.) It is easily seen that this maximum entropy joint distribution must have the form logPr(1φ1,…,1φn)=∑i<jθij1φi∧φj+∑iθi1φi−logZ where logZ is the normalizing constant, or partition function. This is an appealing distribution to use, and easy to do conditioning on and to add new variables to. Computing relative entropy reduces to finding bivariate marginals and to computing Z, and computing marginals reduces to computing Z, which is intractable in general[^istrail], though easy if the Markov graph (ie the graph with edges ij for θi,j≠0) is a forest.
There have been many approaches to this problem (Wainwright and Jordan[^wainwright] is a good survey), but the main ways to extend the applicability from forests have been:
decompose components of the graph as “junction trees”, ie trees whose nodes are overlapping clusters of nodes from the original graph; this permits exact computation with cost exponential in the cluster-sizes, ie in the treewidth[^pearl]
make use of clever combinatorial work coming out of statistical mechanics to do exact computation on “outerplanar” graphs, or on general graphs with cost exponential in the (outer-)graph genus[^schraudolph]
find nodes such that conditioning on those nodes greatly simplifies the graph (eg makes it singly connected), and sum over their possible values explicitly (this has cost exponential in the number of nodes being conditioned on)
A newer class of models, called sum-product networks[^poon], generalizes these and other models by writing the total joint probability as a positive polynomial 1=∑1x1,…,xn=0Pr(1φ1=x1,…,1φn=xn)1x1φ111−x1¯φ1…1xnφn11−xn¯φn in the variables 1φ1,1¯φ1,…,1φn,1¯φn and requiring only that this polynomial be simplifiable to an expression requiring a tractable number of additions and multiplications to evaluate. This allows easy computation of marginals, conditionals, and KL divergence, though it will likely be necessary to do some approximate simplification every so often (otherwise the complexity may accumulate, even with a fixed maximum number of sentences being considered at a time).
However, if we want to stay close to the context of the Non-Omniscience paper, we can do approximate calculations of the partition function on the complete graph—in particular, the Bethe partition function[^weller] has been widely used in practice, and while it’s not logconvex like Z is, it’s often a better approximation to the partition function than well-known convex approximations such as TRW.
[^istrail]: Istrail, Sorin. “Statistical mechanics, three-dimensionality and NP-completeness: I. Universality of intractability for the partition function of the Ising model across non-planar surfaces.” In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pp. 87-96. ACM, 2000.
[^weller]: Weller, Adrian. “Bethe and Related Pairwise Entropy Approximations.”
[^pearl]: Pearl, Judea. “Probabilistic reasoning in intelligent systems: Networks of plausible reasoning.” (1988).
[^schraudolph]: Schraudolph, Nicol N., and Dmitry Kamenetsky. “Efficient Exact Inference in Planar Ising Models.” arXiv preprint arXiv:0810.4401 (2008).
[^wainwright]: Wainwright, Martin J., and Michael I. Jordan. “Graphical models, exponential families, and variational inference.” Foundations and Trends® in Machine Learning 1, no. 1-2 (2008): 1-305.
[^poon]: Poon, Hoifung, and Pedro Domingos. “Sum-product networks: A new deep architecture.” In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689-690. IEEE, 2011.