Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over
all the reals
???
There are easy to compute bijections from R to [0,1], etc.
The Bayesian approach to this problem uses an hierarchical distribution with two levels: one
specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other
specifying the distribution p[X]
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
Latent variables are still a pain, though.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the
graph
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
???
There are easy to compute bijections from R to [0,1], etc.
Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
where we marginalize out the ui variables.
I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
This is not true, there are bijections between R and (0,1), but not the closed interval.
Anyway there are more striking examples, for example if you know that A, B, C, D are in a discrete finite set, it restricts yout choices quite a lot.
No.
Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
You’re right, I meant continuous bijections, as the context was a transformation of a probability distribution.
You are right, apologies.