The resulting distribution is not elegant (as far as I can tell).
In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).
For Bayesians, this problem does not involve “unrestricted densities” at all.
I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don’t think you get to keep your “fully general formalism” chops if you suddenly start redefining my problem for me.
how does one test for independence between both pairs simultaneously without assuming that the events
(A independent of B) and (C independent of D) are independent?
This is a good question. I don’t know a good answer to this that does not involve dealing with the likelihood in some way.
Sorry, I didn’t mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I’m going to look at it and say “Ok, this is binary?” or “Ok, these are positive reals?” or something. The space gives the prior model. Without that information, there is no Bayesian answer.
I guess you could say that this isn’t fully general because we don’t have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I’d say it’s not completely general, because we don’t have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying “Not enough information.”
On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.
Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p’s themselves, so it’s an hierarchical model. So yes, it’s not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.
I mean that data always comes with a space, and that restricts the density.
Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn’t really restrict anything.
For the spaces people actually deal with we have priors.
I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.
On the other hand, I would be very surprised to see any other method which works in cases where the
Bayesian formalism does not yield an answer.
People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.
Regarding that ugly distribution
You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That’s sort of the opposite of ugly. I think it’s beautiful! H1 is an instance of Thomas Richardson’s ancestral graph models, with the graph:
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over
all the reals
???
There are easy to compute bijections from R to [0,1], etc.
The Bayesian approach to this problem uses an hierarchical distribution with two levels: one
specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other
specifying the distribution p[X]
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
Latent variables are still a pain, though.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the
graph
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
Thanks for this post.
In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).
I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don’t think you get to keep your “fully general formalism” chops if you suddenly start redefining my problem for me.
This is a good question. I don’t know a good answer to this that does not involve dealing with the likelihood in some way.
Sorry, I didn’t mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I’m going to look at it and say “Ok, this is binary?” or “Ok, these are positive reals?” or something. The space gives the prior model. Without that information, there is no Bayesian answer.
I guess you could say that this isn’t fully general because we don’t have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I’d say it’s not completely general, because we don’t have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying “Not enough information.”
On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.
Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p’s themselves, so it’s an hierarchical model. So yes, it’s not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.
Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn’t really restrict anything.
I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.
People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.
You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That’s sort of the opposite of ugly. I think it’s beautiful! H1 is an instance of Thomas Richardson’s ancestral graph models, with the graph:
A <-> B <-> C <-> D <-> A
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
???
There are easy to compute bijections from R to [0,1], etc.
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
This is not true, there are bijections between R and (0,1), but not the closed interval.
Anyway there are more striking examples, for example if you know that A, B, C, D are in a discrete finite set, it restricts yout choices quite a lot.
No.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
You’re right, I meant continuous bijections, as the context was a transformation of a probability distribution.
You are right, apologies.