OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.
I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn’t Bayesian.
My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can’t make sense of importance sampling in a Bayesian framework. My example shows that you can’t make sense of “conditional sampling” in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.
We did talk about this before, that’s how I first learnt of the R/W example.
I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | “we assign you to a given C”].
“Causal inference is a missing data problem, and missing data is a causal inference problem.”
Yes, I think you are missing something (although it is true that causal inference is a missing data problem).
It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.
The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.
It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], ” e.g. the treatment effect on the treated.” However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).
Not following. By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from “MAR” to “MCAR” (in causal inference from “conditional ignorablity” to “ignorability.”) Then your procedure depends on this distribution (but your target does not, this is true). Similarly “p(y | do(a))” does not change, but the functional of the observed data equal to “p(y | do(a))” will change if you change the treatment assignment distribution.
(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called “effect of treatment on the indirectly treated” that’s like that).
By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not?
Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.
I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don’t get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:
How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and
then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This
gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as
quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to
capturing my beliefs.
Is this just intuition or did you write this up somewhere? That sounds very interesting.
Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn’t that trivially at least as hard?
The confusion may come from mixing up my setup and Robins/Ritov’s setup. There is no missing data in my setup.
I could write up my intuition for the hierarchical model. It’s an almost trivial result if you don’t assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don’t know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won’t change the result substantially.
It is quite possible that estimating E[Y] and E[Y|event] are “equivalently hard”, but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.
OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.
I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn’t Bayesian.
See also this:
http://www.biostat.harvard.edu/robins/coda.pdf
I think we talked about this before.
My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can’t make sense of importance sampling in a Bayesian framework. My example shows that you can’t make sense of “conditional sampling” in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.
We did talk about this before, that’s how I first learnt of the R/W example.
I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | “we assign you to a given C”].
“Causal inference is a missing data problem, and missing data is a causal inference problem.”
Or I may be “missing” something. :)
Yes, I think you are missing something (although it is true that causal inference is a missing data problem).
It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.
The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.
It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], ” e.g. the treatment effect on the treated.” However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).
Not following. By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from “MAR” to “MCAR” (in causal inference from “conditional ignorablity” to “ignorability.”) Then your procedure depends on this distribution (but your target does not, this is true). Similarly “p(y | do(a))” does not change, but the functional of the observed data equal to “p(y | do(a))” will change if you change the treatment assignment distribution.
(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called “effect of treatment on the indirectly treated” that’s like that).
Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.
Still slightly confused.
I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don’t get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:
Is this just intuition or did you write this up somewhere? That sounds very interesting.
Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn’t that trivially at least as hard?
The confusion may come from mixing up my setup and Robins/Ritov’s setup. There is no missing data in my setup.
I could write up my intuition for the hierarchical model. It’s an almost trivial result if you don’t assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don’t know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won’t change the result substantially.
It is quite possible that estimating E[Y] and E[Y|event] are “equivalently hard”, but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.