I don’t think this is right. In the R/W example they are interested in some number. Statisticians are always interested in some number or other! A frequentist will put an interval around this number with some properties. A Bayesian will try to construct a setup where the posterior ends up concentrating around this number. The point is, it takes a Bayesian (who ignores relevant info) forever to get there, while it does not take the frequentist forever. It is not a frequentist demand that you get to the right answer in a reasonable number of samples, that’s a standard demand we place on statisticial inference!
What’s going wrong here for Bayesians is they are either ignoring information (which is always silly), or doing an extremely unnatural setup to not ignore information. Frequentists are quite content to exploit information outside the likelihood, Bayesians are forbidden from doing so by their framework (except in the prior of course).
Ah, and now I think I see what’s going on!
I don’t think this example is adversarial (in the sense of somewhat artificial constructions people do to screw up a particular algorithm). This is a very natural problem that comes up constantly. You don’t have to carefully pick your assignment probability to screw up the Bayesian, either, almost any such probability would work in this example (unless it’s an independent coin flip, then R/W point out Bayesians have a good solution).
In fact, I could give you an infinite family of such examples, if you wanted, by just converting causal inference problems into the R/W setup where lots of info lives outside the likelihood function.
You can’t really say “oh I believe in the likelihood principle,” and then rule out examples where the principle fails as unnatural or adversarial. Maybe the principle isn’t so good.
I don’t understand at all this business with “logical omniscience” and how it’s supposed to save you.
If the Bayesian’s ignoring information, then you gave them the wrong prior. As far as I can tell, the objection is that the prior over theta which doesn’t ignore the information depends on pi, and intuitions say that Bayesians should think that pi should be independent from theta. But if theta can be chosen in response to pi, then the Bayesian prior over theta had better depend on pi.
I wasn’t saying that this problem is “adversarial” in the “you’re punishing Bayesians therefore I don’t have to win” way; I agree that that would be a completely invalid argument. I was saying “if you want me to succeed even when theta is chosen by someone who doesn’t like me after pi is chosen, I need a prior over theta which depends on pi.” Then everything works out, except that Robins and Wasserman complain that this is torturing Bayesiansim to give a frequentist answer. To that, I shrug. You want me to get the frequentist result (“no matter which theta you pick I converge”) then the result will look frequentist. Not much surprise there.
This is a very natural problem that comes up constantly.
You realize that the Bayesian gets the right answer way faster than the frequentist in situations where theta is discrete, or sufficiently smooth, or parametric, right? I doubt you find problems like this where theta is non-parametric and utterly discontinuous “naturally” or “constantly”. But even if you do, the Bayesian will still succeed with a prior over theta that is independent of pi, except when the pi is so complicated and theta that is so discontinuous and so precisely tailored to hiding information in places that pi makes it very very difficult to observe that the only way you can learn theta is by knowing that it’s been tailored to that particular pi. (The frequentist is essentially always assuming that theta is tailored to pi in this way, because they’re essentially acting like theta might have been selected by an adversary, because that’s what you do if you want to converge in all cases.) And even in that case the Bayesian can succeed by putting a prior on theta that depends on pi. What’s the problem?
Imagine there’s a game where the two of us will both toss an infinite number of uncorrelated fair coins, and then check which real numbers are encoded by these infinite bit sequences. Using any sane prior, I’ll assign measure zero to the event “we got the same real number.” If you’re then like “Aha! But what if my coin actually always returns the same result as yours?” then I’m going to shrug and use a prior which assigns some non-zero probability to a correlation between our coins.
Robins and Wasserman’s game is similar. We’re imagining a non-parametric theta that’s very difficult to learn about, which is like the first infinite coin sequence (and their example does require that it encode infinite information). Then we also imagine that there’s some function pi which makes certain places easier or harder to learn about, which is like the second coin sequence. Robins and Wasserman claim, roughly, that for some finite set of observations and sufficiently complicated pi, a reasonable Bayesian will place ~zero probability on theta just happening to hide all its terrible discontinuities in that pi in just such a way that the only way you can learn theta is by knowing that it is one of the thetas that hides its information in that particular pi; this would be like the coin sequences coinciding. Fine, I agree that under sane priors and for sufficiently complex functions pi, that event has measure zero—if theta is as unstructured as you say, it would take an infinite confluence of coincident events to make it one of the thetas that happens to hide all its important information precisely such that this particular pi makes it impossible to learn.
If you then say “Aha! Now I’m going to score you by your performance against precisely those thetas that hide in that pi!” then I’m going to shrug and require a prior which assigns some non-zero probability to theta being one of the thetas that hides its info in pi.
That normally wouldn’t require any surgery to the intuitive prior (I place positive but small probability on any finite pair of sequences of coin tosses being identical), but if we’re assuming that it actually takes an infinite confluence of coincident events for theta to hide its info in pi and you still want to measure me against thetas that do this, then yeah, I’m going to need a prior over theta that depends on pi. You can cry “that’s violating the spirit of Bayes” all you want, but it still works.
And in the real world, I do want a prior which can eventually say “huh, our supposedly independent coins have come up the same way 2^trillion times, I wonder if they’re actually correlated?” or which can eventually say “huh, this theta sure seems to be hiding lots of very important information in the places that pi makes it super hard to observe, I wonder if they’re actually correlated?” so I’m quite happy to assign some (possibly very tiny) non-zero prior probability on a correlation between the two of them. Overall, I don’t find this problem perturbing.
You can’t really say “oh I believe in the likelihood principle,” and then rule out examples where the principle fails as unnatural or adversarial.
Sure, as long as you shrug and do what works, we have nothing to discuss :).
I do agree that the insight that makes this go through is basically Frequentist, regardless of setup. All the magic happened in the prior before you started.
I don’t think this is right. In the R/W example they are interested in some number. Statisticians are always interested in some number or other! A frequentist will put an interval around this number with some properties. A Bayesian will try to construct a setup where the posterior ends up concentrating around this number. The point is, it takes a Bayesian (who ignores relevant info) forever to get there, while it does not take the frequentist forever. It is not a frequentist demand that you get to the right answer in a reasonable number of samples, that’s a standard demand we place on statisticial inference!
What’s going wrong here for Bayesians is they are either ignoring information (which is always silly), or doing an extremely unnatural setup to not ignore information. Frequentists are quite content to exploit information outside the likelihood, Bayesians are forbidden from doing so by their framework (except in the prior of course).
I don’t think this example is adversarial (in the sense of somewhat artificial constructions people do to screw up a particular algorithm). This is a very natural problem that comes up constantly. You don’t have to carefully pick your assignment probability to screw up the Bayesian, either, almost any such probability would work in this example (unless it’s an independent coin flip, then R/W point out Bayesians have a good solution).
In fact, I could give you an infinite family of such examples, if you wanted, by just converting causal inference problems into the R/W setup where lots of info lives outside the likelihood function.
You can’t really say “oh I believe in the likelihood principle,” and then rule out examples where the principle fails as unnatural or adversarial. Maybe the principle isn’t so good.
I don’t understand at all this business with “logical omniscience” and how it’s supposed to save you.
If the Bayesian’s ignoring information, then you gave them the wrong prior. As far as I can tell, the objection is that the prior over theta which doesn’t ignore the information depends on pi, and intuitions say that Bayesians should think that pi should be independent from theta. But if theta can be chosen in response to pi, then the Bayesian prior over theta had better depend on pi.
I wasn’t saying that this problem is “adversarial” in the “you’re punishing Bayesians therefore I don’t have to win” way; I agree that that would be a completely invalid argument. I was saying “if you want me to succeed even when theta is chosen by someone who doesn’t like me after pi is chosen, I need a prior over theta which depends on pi.” Then everything works out, except that Robins and Wasserman complain that this is torturing Bayesiansim to give a frequentist answer. To that, I shrug. You want me to get the frequentist result (“no matter which theta you pick I converge”) then the result will look frequentist. Not much surprise there.
You realize that the Bayesian gets the right answer way faster than the frequentist in situations where theta is discrete, or sufficiently smooth, or parametric, right? I doubt you find problems like this where theta is non-parametric and utterly discontinuous “naturally” or “constantly”. But even if you do, the Bayesian will still succeed with a prior over theta that is independent of pi, except when the pi is so complicated and theta that is so discontinuous and so precisely tailored to hiding information in places that pi makes it very very difficult to observe that the only way you can learn theta is by knowing that it’s been tailored to that particular pi. (The frequentist is essentially always assuming that theta is tailored to pi in this way, because they’re essentially acting like theta might have been selected by an adversary, because that’s what you do if you want to converge in all cases.) And even in that case the Bayesian can succeed by putting a prior on theta that depends on pi. What’s the problem?
Imagine there’s a game where the two of us will both toss an infinite number of uncorrelated fair coins, and then check which real numbers are encoded by these infinite bit sequences. Using any sane prior, I’ll assign measure zero to the event “we got the same real number.” If you’re then like “Aha! But what if my coin actually always returns the same result as yours?” then I’m going to shrug and use a prior which assigns some non-zero probability to a correlation between our coins.
Robins and Wasserman’s game is similar. We’re imagining a non-parametric theta that’s very difficult to learn about, which is like the first infinite coin sequence (and their example does require that it encode infinite information). Then we also imagine that there’s some function pi which makes certain places easier or harder to learn about, which is like the second coin sequence. Robins and Wasserman claim, roughly, that for some finite set of observations and sufficiently complicated pi, a reasonable Bayesian will place ~zero probability on theta just happening to hide all its terrible discontinuities in that pi in just such a way that the only way you can learn theta is by knowing that it is one of the thetas that hides its information in that particular pi; this would be like the coin sequences coinciding. Fine, I agree that under sane priors and for sufficiently complex functions pi, that event has measure zero—if theta is as unstructured as you say, it would take an infinite confluence of coincident events to make it one of the thetas that happens to hide all its important information precisely such that this particular pi makes it impossible to learn.
If you then say “Aha! Now I’m going to score you by your performance against precisely those thetas that hide in that pi!” then I’m going to shrug and require a prior which assigns some non-zero probability to theta being one of the thetas that hides its info in pi.
That normally wouldn’t require any surgery to the intuitive prior (I place positive but small probability on any finite pair of sequences of coin tosses being identical), but if we’re assuming that it actually takes an infinite confluence of coincident events for theta to hide its info in pi and you still want to measure me against thetas that do this, then yeah, I’m going to need a prior over theta that depends on pi. You can cry “that’s violating the spirit of Bayes” all you want, but it still works.
And in the real world, I do want a prior which can eventually say “huh, our supposedly independent coins have come up the same way 2^trillion times, I wonder if they’re actually correlated?” or which can eventually say “huh, this theta sure seems to be hiding lots of very important information in the places that pi makes it super hard to observe, I wonder if they’re actually correlated?” so I’m quite happy to assign some (possibly very tiny) non-zero prior probability on a correlation between the two of them. Overall, I don’t find this problem perturbing.
I agree completely!
Sure, as long as you shrug and do what works, we have nothing to discuss :).
I do agree that the insight that makes this go through is basically Frequentist, regardless of setup. All the magic happened in the prior before you started.