I think I’m beginning to see the problem for the Bayesian, although I not yet sure what the correct response to it is. I have some more or less rambling thoughts about it.
It appears that the Bayesian is being supposed to start from a flat prior over the space of all possible thetas. This is a very large space (all possible strings of 2^100000 probabilities), almost all of which consists of thetas which are independent of pi. (ETA: Here I mistakenly took X to be a product of two-point sets {0,1}, when in fact it is a product of unit intervals [0,1]. I don’t think this makes much difference to the argument though, or if it does, it would be best addressed by letting this one stand as is and discussing that case separately.) When theta is independent of pi, it seems to me that the Bayesian would simply take the average of sampled values of Y as an estimate of P(Y=1), and be very likely to get almost the same value as the frequentist. Indirectly observing a few values of theta (through the observed values of Y) gives no information about any other values of theta, because the prior was flat. This is why the likelihood calculated in the blog post contains almost no information about theta.
Here is what seems to be to be a related problem. You will be presented with a series of some number of booleans, say 100. After each one, you are to guess the next. If your prior is a flat distribution over {0,1}^100, your prediction will be 50% each way at every stage, regardless of what the sequence so far has been, because all continuations are equally likely. It is impossible to learn from such a prior, which has built into it the belief that the past cannot predict the future.
As noted in the blog post, smoothness of theta with respect to e.g. the metric structure of {0,1}^100000 doesn’t help, because a sample of only 1000 from this space is overwhelmingly likely to consist of points that are all at a Manhattan distance of about 50000 from each other. No substantial extrapolation of theta is possible from such a sample unless it is smooth at the scale of the whole space.
The flat prior over theta seems to be of a similar nature to the flat prior over sequences. If in this sample of 1000 you noticed that when pi was high, the corresponding value of Y, when sampled, was very likely to be 1, and similarly that when pi was low, Y was usually 0 among those rare times it was sampled, you might find it reasonable to conclude that pi and theta were related and use something like the Horwitz-Thompson estimator. But the flat prior over theta does not allow this inference. However many values of theta you have gained some partial information about by sampling Y, they tell you nothing about any other values.
My guess so far is that that is a problem with the flat prior over theta. The problem for the Bayesian is to come up with a better one that is capable of seeing a dependency between pi and theta.
Is the Robins and Ritov paper the one cited in the blog post, “Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models”? I looked at that briefly, only enough to see that their example, though somewhat similar, deals with a relatively low dimensional case (5), which in practical terms counts as high dimensional, and what they describe as a “moderate” sample size of 10000. So that’s rather different from the present example, and I don’t know if anything I just said will be relevant to it.
On reading further in the blog post, I see that a lot of what I said is said more briefly in the comments there, especially comment (4) by Chris Sims:
If theta and pi were independent, we could just throw out the observations where we don’t see Y and use the remaining sample as if there were no “R” variable. So specifying that theta and pi are independent is not a reasonable way to say we have little knowedge. It amounts to saying we are sure the main potential complication in the model is not present, and therefore opens us up to making seriously incorrect inference.
And a flat prior on theta is an assumption that theta and pi are almost certainly independent.
The right way out is to have a “weird” prior that mirrors frequentist behavior. Which, as the authors point out, is perfectly fine, but why bother? By the way Bayes can’t use Horvitz-Thompson directly because it’s not a likelihood based estimator, I think you have to somehow bake the entire thing into the prior.
The insight that lets you structure your B setup properly here is sort of coming from “the outside the problem,” too.
A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.
Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I’ll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn’t, that would be better addressed with a separate comment.
I think I’m beginning to see the problem for the Bayesian, although I not yet sure what the correct response to it is. I have some more or less rambling thoughts about it.
It appears that the Bayesian is being supposed to start from a flat prior over the space of all possible thetas. This is a very large space (all possible strings of 2^100000 probabilities), almost all of which consists of thetas which are independent of pi. (ETA: Here I mistakenly took X to be a product of two-point sets {0,1}, when in fact it is a product of unit intervals [0,1]. I don’t think this makes much difference to the argument though, or if it does, it would be best addressed by letting this one stand as is and discussing that case separately.) When theta is independent of pi, it seems to me that the Bayesian would simply take the average of sampled values of Y as an estimate of P(Y=1), and be very likely to get almost the same value as the frequentist. Indirectly observing a few values of theta (through the observed values of Y) gives no information about any other values of theta, because the prior was flat. This is why the likelihood calculated in the blog post contains almost no information about theta.
Here is what seems to be to be a related problem. You will be presented with a series of some number of booleans, say 100. After each one, you are to guess the next. If your prior is a flat distribution over {0,1}^100, your prediction will be 50% each way at every stage, regardless of what the sequence so far has been, because all continuations are equally likely. It is impossible to learn from such a prior, which has built into it the belief that the past cannot predict the future.
As noted in the blog post, smoothness of theta with respect to e.g. the metric structure of {0,1}^100000 doesn’t help, because a sample of only 1000 from this space is overwhelmingly likely to consist of points that are all at a Manhattan distance of about 50000 from each other. No substantial extrapolation of theta is possible from such a sample unless it is smooth at the scale of the whole space.
The flat prior over theta seems to be of a similar nature to the flat prior over sequences. If in this sample of 1000 you noticed that when pi was high, the corresponding value of Y, when sampled, was very likely to be 1, and similarly that when pi was low, Y was usually 0 among those rare times it was sampled, you might find it reasonable to conclude that pi and theta were related and use something like the Horwitz-Thompson estimator. But the flat prior over theta does not allow this inference. However many values of theta you have gained some partial information about by sampling Y, they tell you nothing about any other values.
My guess so far is that that is a problem with the flat prior over theta. The problem for the Bayesian is to come up with a better one that is capable of seeing a dependency between pi and theta.
Is the Robins and Ritov paper the one cited in the blog post, “Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models”? I looked at that briefly, only enough to see that their example, though somewhat similar, deals with a relatively low dimensional case (5), which in practical terms counts as high dimensional, and what they describe as a “moderate” sample size of 10000. So that’s rather different from the present example, and I don’t know if anything I just said will be relevant to it.
On reading further in the blog post, I see that a lot of what I said is said more briefly in the comments there, especially comment (4) by Chris Sims:
And a flat prior on theta is an assumption that theta and pi are almost certainly independent.
Yes the CODA paper is what I meant.
The right way out is to have a “weird” prior that mirrors frequentist behavior. Which, as the authors point out, is perfectly fine, but why bother? By the way Bayes can’t use Horvitz-Thompson directly because it’s not a likelihood based estimator, I think you have to somehow bake the entire thing into the prior.
The insight that lets you structure your B setup properly here is sort of coming from “the outside the problem,” too.
A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.
Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I’ll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn’t, that would be better addressed with a separate comment.
Yeah, I don’t think it makes much difference in high-dimensions. It’s just more natural to talk about smoothness in the continuous case.