Say I am interested in distinguishing between two hypotheses for p(a,b,c,d) (otherwise unrestricted):
hypothesis 1: “A is independent of B, C is independent of D, and nothing else is true”
hypothesis 2: “no independences hold”
Frequentists can run their non-parametric marginal independence tests. What is the (a?) Bayesian procedure here? As far as I can tell, for unrestricted densities p(a,b,c,d) no one knows how to write down the likelihood for H1. You can do a standard Bayesian setup here in some cases, e.g. if p(a,b,c,d) is multivariate normal, in which case H1 corresponds to a (simple) Gaussian ancestral graph model. Maybe one can do some non-parametric Bayes thing (???). It’s not so simple to set up the right model sometimes, which is what Bayesian methods generally need.
For Bayesians, this problem does not involve “unrestricted densities” at all. We are given some data and presumably we know the space from which it was drawn (e.g. binary, categorical, reals...). That alone specifies a unique model distribution. For discrete data, symmetry arguments mandate a Dirichlet model prior with the categories given by all possible outcomes of {A,B,C,D}. For H2, the Dirichlet parameters are updated in the usual fashion and P[data | H2] calculated accordingly.
For H1, our Dirichlet prior is further restricted according to the independencies. The resulting distribution is not elegant (as far as I can tell), but it does exist and can be updated. For example, if the variables are all binary, then the Dirichlet for H2 has 16 categories. We’ll call the 16 frequencies X0000, X0001, X0010, … with parameters a0000, a0001, … where the XABCD are the probabilities which the model given by X assigns to each outcome. Already, the Dirichlet for H2 is constrained to {X | sum(X) = 1, X > 0} within R^16. The Dirichlet for H1 is exactly the same function, but further constrained to the space {X | sum(X) = 1, X > 0, X00.. / X10.. = X01.. / X11.., X..00 / X10.. = X..01 / X..11} within R^16. This is probably painful to work with (analytically at the very least), but is fine in principle.
So we have P[data | H1] and P[data | H2]. That just leaves the prior probabilities for each model. At first glance, it might seem that H1 has zero prior, since it corresponds to a measure-zero subset of H2. But really, we must have SOME prior information lending H1 a nonzero prior probability or we wouldn’t bother comparing the two in the first place. Beyond that, we’d have to come up with reasonable probabilities based on whatever prior information we have. Given no other information besides the fact that we’re comparing the two, it would be 50⁄50.
Of course this is all completely unscalable. Fortunately, we can throw away information to save computation. More specifically, we can discretize and bin things much like we would for simple marginal independence tests. While it won’t yield the ideal Bayesian result, it is still the ideal result given only the binned data.
I am a bit curious about the non-parametric tests used for H1. I am familiar with tests for whether A and B are independent, and of course they can be applied between C and D, but how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent? It is precisely this difficulty which makes the Bayesian likelihood calculation of H1 such a mess, and I am curious how frequentist methods approach it.
My apologies for the truly awful typesetting, but this is not the evening on which I learn to integrate tex in lesswrong posts.
The resulting distribution is not elegant (as far as I can tell).
In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).
For Bayesians, this problem does not involve “unrestricted densities” at all.
I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don’t think you get to keep your “fully general formalism” chops if you suddenly start redefining my problem for me.
how does one test for independence between both pairs simultaneously without assuming that the events
(A independent of B) and (C independent of D) are independent?
This is a good question. I don’t know a good answer to this that does not involve dealing with the likelihood in some way.
Sorry, I didn’t mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I’m going to look at it and say “Ok, this is binary?” or “Ok, these are positive reals?” or something. The space gives the prior model. Without that information, there is no Bayesian answer.
I guess you could say that this isn’t fully general because we don’t have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I’d say it’s not completely general, because we don’t have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying “Not enough information.”
On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.
Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p’s themselves, so it’s an hierarchical model. So yes, it’s not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.
I mean that data always comes with a space, and that restricts the density.
Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn’t really restrict anything.
For the spaces people actually deal with we have priors.
I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.
On the other hand, I would be very surprised to see any other method which works in cases where the
Bayesian formalism does not yield an answer.
People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.
Regarding that ugly distribution
You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That’s sort of the opposite of ugly. I think it’s beautiful! H1 is an instance of Thomas Richardson’s ancestral graph models, with the graph:
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over
all the reals
???
There are easy to compute bijections from R to [0,1], etc.
The Bayesian approach to this problem uses an hierarchical distribution with two levels: one
specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other
specifying the distribution p[X]
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
Latent variables are still a pain, though.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the
graph
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
Say I am interested in distinguishing between two hypotheses for p(a,b,c,d) (otherwise unrestricted):
hypothesis 1: “A is independent of B, C is independent of D, and nothing else is true”
hypothesis 2: “no independences hold”
Frequentists can run their non-parametric marginal independence tests. What is the (a?) Bayesian procedure here? As far as I can tell, for unrestricted densities p(a,b,c,d) no one knows how to write down the likelihood for H1. You can do a standard Bayesian setup here in some cases, e.g. if p(a,b,c,d) is multivariate normal, in which case H1 corresponds to a (simple) Gaussian ancestral graph model. Maybe one can do some non-parametric Bayes thing (???). It’s not so simple to set up the right model sometimes, which is what Bayesian methods generally need.
You should check out chapter 20 of Jaynes’ Probability Theory, which talks about Bayesian model comparison.
We wish to calculate P[H1 | data] / P[H2 | data] = P[data | H1] / P[data | H2] * P[H1] / P[H2].
For Bayesians, this problem does not involve “unrestricted densities” at all. We are given some data and presumably we know the space from which it was drawn (e.g. binary, categorical, reals...). That alone specifies a unique model distribution. For discrete data, symmetry arguments mandate a Dirichlet model prior with the categories given by all possible outcomes of {A,B,C,D}. For H2, the Dirichlet parameters are updated in the usual fashion and P[data | H2] calculated accordingly.
For H1, our Dirichlet prior is further restricted according to the independencies. The resulting distribution is not elegant (as far as I can tell), but it does exist and can be updated. For example, if the variables are all binary, then the Dirichlet for H2 has 16 categories. We’ll call the 16 frequencies X0000, X0001, X0010, … with parameters a0000, a0001, … where the XABCD are the probabilities which the model given by X assigns to each outcome. Already, the Dirichlet for H2 is constrained to {X | sum(X) = 1, X > 0} within R^16. The Dirichlet for H1 is exactly the same function, but further constrained to the space {X | sum(X) = 1, X > 0, X00.. / X10.. = X01.. / X11.., X..00 / X10.. = X..01 / X..11} within R^16. This is probably painful to work with (analytically at the very least), but is fine in principle.
So we have P[data | H1] and P[data | H2]. That just leaves the prior probabilities for each model. At first glance, it might seem that H1 has zero prior, since it corresponds to a measure-zero subset of H2. But really, we must have SOME prior information lending H1 a nonzero prior probability or we wouldn’t bother comparing the two in the first place. Beyond that, we’d have to come up with reasonable probabilities based on whatever prior information we have. Given no other information besides the fact that we’re comparing the two, it would be 50⁄50.
Of course this is all completely unscalable. Fortunately, we can throw away information to save computation. More specifically, we can discretize and bin things much like we would for simple marginal independence tests. While it won’t yield the ideal Bayesian result, it is still the ideal result given only the binned data.
I am a bit curious about the non-parametric tests used for H1. I am familiar with tests for whether A and B are independent, and of course they can be applied between C and D, but how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent? It is precisely this difficulty which makes the Bayesian likelihood calculation of H1 such a mess, and I am curious how frequentist methods approach it.
My apologies for the truly awful typesetting, but this is not the evening on which I learn to integrate tex in lesswrong posts.
Thanks for this post.
In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).
I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don’t think you get to keep your “fully general formalism” chops if you suddenly start redefining my problem for me.
This is a good question. I don’t know a good answer to this that does not involve dealing with the likelihood in some way.
Sorry, I didn’t mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I’m going to look at it and say “Ok, this is binary?” or “Ok, these are positive reals?” or something. The space gives the prior model. Without that information, there is no Bayesian answer.
I guess you could say that this isn’t fully general because we don’t have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I’d say it’s not completely general, because we don’t have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying “Not enough information.”
On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.
Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p’s themselves, so it’s an hierarchical model. So yes, it’s not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.
Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn’t really restrict anything.
I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.
People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.
You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That’s sort of the opposite of ugly. I think it’s beautiful! H1 is an instance of Thomas Richardson’s ancestral graph models, with the graph:
A <-> B <-> C <-> D <-> A
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
???
There are easy to compute bijections from R to [0,1], etc.
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
This is not true, there are bijections between R and (0,1), but not the closed interval.
Anyway there are more striking examples, for example if you know that A, B, C, D are in a discrete finite set, it restricts yout choices quite a lot.
No.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
You’re right, I meant continuous bijections, as the context was a transformation of a probability distribution.
You are right, apologies.