It will take a while to understand it, but by the end of section 3 I was wondering when the assumption that X is a binary string was going to be used. Not at all, so far. The space might as well have been defined as just a set of 2^d arbitrary things. So I anticipate that introducing a smoothness assumption on theta, foreshadowed at this point, won’t help—there is no structure for theta to be smooth with respect to. Surely this is why the only information about X that can be used to estimate Y is π(X)? That is the only information about X that is available, the way the problem is set up.
The binary thing isn’t important, what’s important is that there are real situations where likelihood based methods (including Bayes) don’t work well (because by assumption there is only strong info on the part of the likelihood we aren’t using in our functional, and the part of the likelihood we are using in our functional is very complicated).
I think my point wasn’t so much the technical specifics of that example, but rather that these are the types of B vs F arguments that actually have something to say, rather than going around and around in circles. I had a rephrase of this example using causal language somewhere on LW (if that will help, not sure if it will).
Robins and Ritov have something of paper length, rather than blog post length if you are interested.
Wait, IlyaShipitser—I think you overestimate my knowledge of the field of statistics. From what it sounds like, there’s an actual, quantitative difference between Bayesian and Frequentist methods. That is, in a given situation, the two will come to totally different results. Is this true?
I should have made it more clear that I don’t care about some abstract philosophical difference if said difference doesn’t mean there are different results (because those differences usually come down to a nonsensical distinction [à la free will]). I was under the impression that there is a claim that some interpretation of the philosophy will fruit different results—but I was missing it, because everything I’ve been introduced to seems to give the same answer.
Is it true that they’re different methods that actually give different answers?
I think it’s more that there are times when frequentists claim there isn’t an answer. It’s very common for statistical tests to talk about likelihood. The likelihood of a hypothesis given an experimental result is defined as the probability of the result given the hypothesis. If you want to know the probability of the hypothesis, you take the likelihood and multiply it by the prior probability. Frequentists deny that there always is a prior probability. As a result, they tend to just use the base rate as if it were a probability. Conflating the two is equivalent to the base rate fallacy.
I think I’m beginning to see the problem for the Bayesian, although I not yet sure what the correct response to it is. I have some more or less rambling thoughts about it.
It appears that the Bayesian is being supposed to start from a flat prior over the space of all possible thetas. This is a very large space (all possible strings of 2^100000 probabilities), almost all of which consists of thetas which are independent of pi. (ETA: Here I mistakenly took X to be a product of two-point sets {0,1}, when in fact it is a product of unit intervals [0,1]. I don’t think this makes much difference to the argument though, or if it does, it would be best addressed by letting this one stand as is and discussing that case separately.) When theta is independent of pi, it seems to me that the Bayesian would simply take the average of sampled values of Y as an estimate of P(Y=1), and be very likely to get almost the same value as the frequentist. Indirectly observing a few values of theta (through the observed values of Y) gives no information about any other values of theta, because the prior was flat. This is why the likelihood calculated in the blog post contains almost no information about theta.
Here is what seems to be to be a related problem. You will be presented with a series of some number of booleans, say 100. After each one, you are to guess the next. If your prior is a flat distribution over {0,1}^100, your prediction will be 50% each way at every stage, regardless of what the sequence so far has been, because all continuations are equally likely. It is impossible to learn from such a prior, which has built into it the belief that the past cannot predict the future.
As noted in the blog post, smoothness of theta with respect to e.g. the metric structure of {0,1}^100000 doesn’t help, because a sample of only 1000 from this space is overwhelmingly likely to consist of points that are all at a Manhattan distance of about 50000 from each other. No substantial extrapolation of theta is possible from such a sample unless it is smooth at the scale of the whole space.
The flat prior over theta seems to be of a similar nature to the flat prior over sequences. If in this sample of 1000 you noticed that when pi was high, the corresponding value of Y, when sampled, was very likely to be 1, and similarly that when pi was low, Y was usually 0 among those rare times it was sampled, you might find it reasonable to conclude that pi and theta were related and use something like the Horwitz-Thompson estimator. But the flat prior over theta does not allow this inference. However many values of theta you have gained some partial information about by sampling Y, they tell you nothing about any other values.
My guess so far is that that is a problem with the flat prior over theta. The problem for the Bayesian is to come up with a better one that is capable of seeing a dependency between pi and theta.
Is the Robins and Ritov paper the one cited in the blog post, “Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models”? I looked at that briefly, only enough to see that their example, though somewhat similar, deals with a relatively low dimensional case (5), which in practical terms counts as high dimensional, and what they describe as a “moderate” sample size of 10000. So that’s rather different from the present example, and I don’t know if anything I just said will be relevant to it.
On reading further in the blog post, I see that a lot of what I said is said more briefly in the comments there, especially comment (4) by Chris Sims:
If theta and pi were independent, we could just throw out the observations where we don’t see Y and use the remaining sample as if there were no “R” variable. So specifying that theta and pi are independent is not a reasonable way to say we have little knowedge. It amounts to saying we are sure the main potential complication in the model is not present, and therefore opens us up to making seriously incorrect inference.
And a flat prior on theta is an assumption that theta and pi are almost certainly independent.
The right way out is to have a “weird” prior that mirrors frequentist behavior. Which, as the authors point out, is perfectly fine, but why bother? By the way Bayes can’t use Horvitz-Thompson directly because it’s not a likelihood based estimator, I think you have to somehow bake the entire thing into the prior.
The insight that lets you structure your B setup properly here is sort of coming from “the outside the problem,” too.
A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.
Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I’ll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn’t, that would be better addressed with a separate comment.
It will take a while to understand it, but by the end of section 3 I was wondering when the assumption that X is a binary string was going to be used. Not at all, so far. The space might as well have been defined as just a set of 2^d arbitrary things. So I anticipate that introducing a smoothness assumption on theta, foreshadowed at this point, won’t help—there is no structure for theta to be smooth with respect to. Surely this is why the only information about X that can be used to estimate Y is π(X)? That is the only information about X that is available, the way the problem is set up.
More when I’ve studied the rest.
The binary thing isn’t important, what’s important is that there are real situations where likelihood based methods (including Bayes) don’t work well (because by assumption there is only strong info on the part of the likelihood we aren’t using in our functional, and the part of the likelihood we are using in our functional is very complicated).
I think my point wasn’t so much the technical specifics of that example, but rather that these are the types of B vs F arguments that actually have something to say, rather than going around and around in circles. I had a rephrase of this example using causal language somewhere on LW (if that will help, not sure if it will).
Robins and Ritov have something of paper length, rather than blog post length if you are interested.
Wait, IlyaShipitser—I think you overestimate my knowledge of the field of statistics. From what it sounds like, there’s an actual, quantitative difference between Bayesian and Frequentist methods. That is, in a given situation, the two will come to totally different results. Is this true?
I should have made it more clear that I don’t care about some abstract philosophical difference if said difference doesn’t mean there are different results (because those differences usually come down to a nonsensical distinction [à la free will]). I was under the impression that there is a claim that some interpretation of the philosophy will fruit different results—but I was missing it, because everything I’ve been introduced to seems to give the same answer.
Is it true that they’re different methods that actually give different answers?
I think it’s more that there are times when frequentists claim there isn’t an answer. It’s very common for statistical tests to talk about likelihood. The likelihood of a hypothesis given an experimental result is defined as the probability of the result given the hypothesis. If you want to know the probability of the hypothesis, you take the likelihood and multiply it by the prior probability. Frequentists deny that there always is a prior probability. As a result, they tend to just use the base rate as if it were a probability. Conflating the two is equivalent to the base rate fallacy.
EY believes so.
I think I’m beginning to see the problem for the Bayesian, although I not yet sure what the correct response to it is. I have some more or less rambling thoughts about it.
It appears that the Bayesian is being supposed to start from a flat prior over the space of all possible thetas. This is a very large space (all possible strings of 2^100000 probabilities), almost all of which consists of thetas which are independent of pi. (ETA: Here I mistakenly took X to be a product of two-point sets {0,1}, when in fact it is a product of unit intervals [0,1]. I don’t think this makes much difference to the argument though, or if it does, it would be best addressed by letting this one stand as is and discussing that case separately.) When theta is independent of pi, it seems to me that the Bayesian would simply take the average of sampled values of Y as an estimate of P(Y=1), and be very likely to get almost the same value as the frequentist. Indirectly observing a few values of theta (through the observed values of Y) gives no information about any other values of theta, because the prior was flat. This is why the likelihood calculated in the blog post contains almost no information about theta.
Here is what seems to be to be a related problem. You will be presented with a series of some number of booleans, say 100. After each one, you are to guess the next. If your prior is a flat distribution over {0,1}^100, your prediction will be 50% each way at every stage, regardless of what the sequence so far has been, because all continuations are equally likely. It is impossible to learn from such a prior, which has built into it the belief that the past cannot predict the future.
As noted in the blog post, smoothness of theta with respect to e.g. the metric structure of {0,1}^100000 doesn’t help, because a sample of only 1000 from this space is overwhelmingly likely to consist of points that are all at a Manhattan distance of about 50000 from each other. No substantial extrapolation of theta is possible from such a sample unless it is smooth at the scale of the whole space.
The flat prior over theta seems to be of a similar nature to the flat prior over sequences. If in this sample of 1000 you noticed that when pi was high, the corresponding value of Y, when sampled, was very likely to be 1, and similarly that when pi was low, Y was usually 0 among those rare times it was sampled, you might find it reasonable to conclude that pi and theta were related and use something like the Horwitz-Thompson estimator. But the flat prior over theta does not allow this inference. However many values of theta you have gained some partial information about by sampling Y, they tell you nothing about any other values.
My guess so far is that that is a problem with the flat prior over theta. The problem for the Bayesian is to come up with a better one that is capable of seeing a dependency between pi and theta.
Is the Robins and Ritov paper the one cited in the blog post, “Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models”? I looked at that briefly, only enough to see that their example, though somewhat similar, deals with a relatively low dimensional case (5), which in practical terms counts as high dimensional, and what they describe as a “moderate” sample size of 10000. So that’s rather different from the present example, and I don’t know if anything I just said will be relevant to it.
On reading further in the blog post, I see that a lot of what I said is said more briefly in the comments there, especially comment (4) by Chris Sims:
And a flat prior on theta is an assumption that theta and pi are almost certainly independent.
Yes the CODA paper is what I meant.
The right way out is to have a “weird” prior that mirrors frequentist behavior. Which, as the authors point out, is perfectly fine, but why bother? By the way Bayes can’t use Horvitz-Thompson directly because it’s not a likelihood based estimator, I think you have to somehow bake the entire thing into the prior.
The insight that lets you structure your B setup properly here is sort of coming from “the outside the problem,” too.
A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.
Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I’ll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn’t, that would be better addressed with a separate comment.
Yeah, I don’t think it makes much difference in high-dimensions. It’s just more natural to talk about smoothness in the continuous case.