So8res comments on The trouble with Bayes (draft)

So8res 24 Oct 2015 2:42 UTC
13 points
As for the Robins / Wasserman example, here’s my initial thoughts. I’m not entirely sure I’m understanding their objection correctly, but at a first pass, nothing seems amiss. I’ll start by gameifying their situation, which helps me understand it better. Their situation seems to work as follows: Imagine an island with a d-dimensional surface (set d=2 for easy visualization). Anywhere along the island, we can dig for treasure, but only if that point on the island is unoccupied. At the beginning of the game, all points on the island are occupied. But people sometimes leave the points with uniform probability, in which case the point can be acquired and whoever acquires it can dig for treasure at that point. (The Xi variables on the blog are points on the island that become unoccupied during the game; we assume this is a uniformly random process.)

We’re considering investing in a given treasure-digging company that’s going to acquire land and dig on this island. At each point on the island, there is some probability of it having treasure. What we want to know, so that we can decide whether to invest, is how much treasure is on the island. We will first observe the treasure company acquire n points of land and dig there, and then we will decide whether to invest. (The Yi variables are the probability of treasure at the corresponding Xi. There is some function theta(x) which determines the probability of treasure at x. We want to estimate the unconditional probability that there is treasure anywhere on the island, this is psi, which is the integral of theta(x) dx.)

However, the company tries to hide facts about whether or not they actually struck treasure. What we do is, we hire a spy firm. Spies aren’t perfect, though, and some points are harder to spy on than others (if they’re out in the open, or have little cover, etc.) For each point on the island, there is some probability of the spies succeeding at observing the treasure diggers. We, fortunately, know exactly how likely the spies are to succeed at any given point. If the spies succeed in their observation, they tell us for sure whether the diggers found treasure. (The successes of the spies are the Ri variables. pi(x) is the probability of successfully spying at point x.)

To summarize, we have three series of variables Xi, Yi, and Ri. All are i.i.d. Yi and Ri are conditionally independent given Xi. The Xi are uniformly distributed. There is some function theta(x) which tells us how likely the there is to be treasure at any given point, and there’s some other function pi(x) which tells us how likely the spies are to successfully observe x. Our task is to estimate psi, the probability of treasure at any random point on the island, which is the integral of theta(x) dx.

The game works as follows: n points x1..xn open on the island, and we observe that those points were acquired by the treasure diggers, and for some of them we send out our spy agency to maybe learn theta(xi). Robins and Wasserman argue something like the following (afaict):

“You observe finitely many instances of theta(x). But the surface of the island is continuous and huge! You’ve observed a teeny tiny fraction of Y-probabilities at certain points, and you have no idea how theta varies across the space, so you’ve basically gained zero information about theta and therefore psi.”

To which I say: Depends on your prior over theta. If you assume that theta can vary wildly across the space, then observing only finitely many theta(xi) tells you almost nothing about theta in general, to be sure. In that case, you learn almost nothing by observing finitely many points—nor should you! If instead you assume that the theta(xi) do give you lots of evidence about theta in general, then you’ll end up with quite a good estimate of psi. If your prior has you somewhere in between, then you’ll end up with an estimate of psi that’s somewhere in between, as you should. The function pi doesn’t factor in at all unless you have reason to believe that pi and theta are correlated (e.g. it’s easier to spy on points that don’t have treasure, or something), but Robins and Wasserman state explicitly that they don’t want to consider those scenarios. (And I’m fine with assuming that pi and theta are uncorrelated.)

(The frequentist approach takes pi into account anyway and ends up eventually concentrating its probability mass mostly around one point psi in the space of possible psi values, causing me to frown very suspiciously, because we were assuming that pi doesn’t tell us anything about psi.)

Robins and Wasserman then argue that the frequentist approach gives the following guarantee: No matter what function theta(x) determines the probability of treasure at x, they only need to observe finitely many points before their estimate for psi is “close” to the true psi (which they define formally). They argue that Bayesians have a very hard time generating a prior that has this property. (They note that it is possible to construct a prior that yields an estimate similar to the frequentist estimate, but that this requires torturing the prior until it gives a frequentist answer, at which point, why not just become a frequentist?)

I say, sure, it’s hard (though not impossible) for a Bayesian to get that sort of guarantee. But nothing is amiss here! Two points:

(a) They claim that it’s disconcerting that the theta(xi) don’t give a Bayesian much information about theta. They admit that there are priors on theta that allow you to get information about theta from finitely many theta(xi), but protest that these theta are pretty weird (“very very very smooth”) if the dimensionality d of the island is very high. In which case I say, if you think that the theta(xi) can’t tell you much about theta, then you shouldn’t be learning about theta when you learn about the various theta(xi)! In fact, I’m suspicious of anyone who says they can, under these assumptions.

Also, I’m not completely convinced that “the observations are uninformative about theta” implies “the observations are uninformative about psi”—I acknowledge that from theta you can compute psi, and thus in some sense theta is the “only unknown,” but I think you might be able to construct a prior where you learn little about theta but lots about psi. (Maybe the i.i.d. assumption rules this possibility out? I’m not sure yet, I haven’t done the math.) But assume we either don’t have any way of getting information about psi except by integrating theta, or that we don’t have a way of doing it except one that looks “tortured” (because otherwise their argument falls through anyway). That brings us to my second point:

(b) They ask for the property that, no matter what theta is the true theta, you, after only finitely many trials, assign very high probability to the true value of psi. That’s a crazy demand! What if the true theta is one where learning finitely many theta(xi) doesn’t give you any information about theta? If we have a theta such that my observations are telling me nothing about it, then I don’t want to be slowly concentrating all my probability mass on one particular value of psi; that would be mad. (Unless the observations are giving me information about psi via some mechanism other than information about theta, which we’re assuming is not the case.)

If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we’ll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the “true” value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren’t informative about theta or (2) they’re making an assumption that amounts to forcing us to use the “tortured” prior. I wouldn’t be too surprised by (2), given that their demand on the posterior is a very frequentist demand, and so asserting that it’s possible to zero in on the true psi using this data in finitely many steps for any theta may very well amount to asserting that the prior is the tortured one that forces a frequentist-looking calculation. They don’t describe the “tortured prior” in the blog post, so I’m not sure what else to say here ¯\_(ツ)_/¯

There are definitely some parts of the argument I’m not following. For example, they claim that for simple functions pi, the Bayesian solution obviously works, but there’s no single prior on theta which works for any pi no matter how complex. I’m very suspicious about this, and I wonder whether they mean is there’s no sane prior which works for any pi, and that that’s the place they’re slipping the “but you can’t be logically omniscient!” objection in, at which point yes, Bayesian reasoning is not the right tool. Unfortunately, I don’t have any more time to spend digging at this problem. By and large, though, my conclusion is this:

If you set the game up as stated, and the observations are actually giving literally zero data about psi, then I will be sticking to my prior on psi, thankyouverymuch. If a frequentist assumes they can use pi to update and zooms off in one direction or another, then they will be wrong most of the time. If you also say the frequentist is performing well then I deny that the observations were giving no info. (By the time they’ve converged, the Bayesian must also have data on theta, or at least psi.) If it’s possible to zero in on the true value of psi after finitely many observations, then I’m going to have to use a prior that allows me to do so, regardless of whether or not it appears tortured to you :-)

(Thanks to Benya for helping me figure out what the heck was going on here.)
- snarles 24 Oct 2015 14:43 UTC
  3 points
  Parent
  
  If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we’ll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the “true” value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren’t informative about theta or (2) they’re making an assumption that amounts to forcing us to use the “tortured” prior. I wouldn’t be too surprised by (2),
  
  The frequentist result does converge, and it is possible to make up a very artificial prior which allows you to converge to psi. But the fact that you can make up a prior that gives you the frequentist answer is not surprising.
  
  A useful perspective is this: there are no Bayesian methods, and there are no frequentist methods. However, there are Bayesian justifications for methods (“it does well based in the average case”) and frequentist justifications (“it does well asymptotically or in a minimax sense”) for methods. If you construct a prior in order to converge to psi asymptotically, then you may be formally using Bayesian machinery, but the justification you could possibly give for your method is completely frequentist.
  - So8res 24 Oct 2015 16:37 UTC
    7 points
    Parent
    I understand the “no methods only justifications” view, but it’s much less comforting when you need to ultimately build a reliable reasoning system :-)
    
    I remain mostly unperturbed by this game. You made a very frequentist demand. From a Bayesian perspective, your demand is quite a strange one. If you force me to achieve it, then yeah, I may end up doing frequentist-looking things.
    
    In attempts to steel-man the Robins/Wasserman position, it seems the place I’m supposed to be perturbed is that I can’t even achieve the frequentist result unless I’m willing to make my prior for theta depend on pi, which seems to violate the spirit of Bayesian inference?
    
    Ah, and now I think I see what’s going on! The game that corresponds to a Bayesian desire for this frequentist property is not the game listed; it’s the variant where theta is chosen adversarially by someone who doesn’t want you to end up with a good estimate for psi. (Then the Bayesian wants a guarantee that they’ll converge for every theta.) But those are precisely the situations where the Bayesian shouldn’t be ignoring pi; the adversary will hide as much contrary data as they can in places that are super-difficult for the spies to observe.
    
    Robins and Wasserman say “once a subjective Bayesian queries the randomizer (who selected pi) about the randomizer’s reasoned opinions concerning theta (but not pi) the Bayesian will have independent priors.” They didn’t show their math on this, but I doubt this point carries their objection. If I ask the person who selected pi how theta was selected, and they say “oh, it was selected in response to pi to cram as much important data as possible into places that are extraordinarily difficult for spies to enter,” then I’m willing to buy that after updating (which I will do) I now have a distribution over theta that’s independent of pi. But this new distribution will be one where I’ll eventually converge to the right answer on this particular pi!
    
    So yeah, if I’m about to start playing the treasure hunting game, and then somebody informs me that theta was actually chosen adversarially after pi was chosen, I’m definitely going to need to update on pi. Which means that if we add an adversary to the game, my prior must depend on pi. Call it forced if you will; but it seems correct to me that if you tell me the game might be adversarial (thus justifying your frequentist demand) then I will expect theta to sometimes be dependent on pi (in the most inconvenient possible way).
    - IlyaShpitser 24 Oct 2015 17:17 UTC
      5 points
      Parent
      
      You made a very frequentist demand.
      
      I don’t think this is right. In the R/W example they are interested in some number. Statisticians are always interested in some number or other! A frequentist will put an interval around this number with some properties. A Bayesian will try to construct a setup where the posterior ends up concentrating around this number. The point is, it takes a Bayesian (who ignores relevant info) forever to get there, while it does not take the frequentist forever. It is not a frequentist demand that you get to the right answer in a reasonable number of samples, that’s a standard demand we place on statisticial inference!
      
      What’s going wrong here for Bayesians is they are either ignoring information (which is always silly), or doing an extremely unnatural setup to not ignore information. Frequentists are quite content to exploit information outside the likelihood, Bayesians are forbidden from doing so by their framework (except in the prior of course).
      
      Ah, and now I think I see what’s going on!
      
      I don’t think this example is adversarial (in the sense of somewhat artificial constructions people do to screw up a particular algorithm). This is a very natural problem that comes up constantly. You don’t have to carefully pick your assignment probability to screw up the Bayesian, either, almost any such probability would work in this example (unless it’s an independent coin flip, then R/W point out Bayesians have a good solution).
      
      In fact, I could give you an infinite family of such examples, if you wanted, by just converting causal inference problems into the R/W setup where lots of info lives outside the likelihood function.
      
      You can’t really say “oh I believe in the likelihood principle,” and then rule out examples where the principle fails as unnatural or adversarial. Maybe the principle isn’t so good.
      
      I don’t understand at all this business with “logical omniscience” and how it’s supposed to save you.
      - So8res 24 Oct 2015 18:19 UTC
        9 points
        Parent
        If the Bayesian’s ignoring information, then you gave them the wrong prior. As far as I can tell, the objection is that the prior over theta which doesn’t ignore the information depends on pi, and intuitions say that Bayesians should think that pi should be independent from theta. But if theta can be chosen in response to pi, then the Bayesian prior over theta had better depend on pi.
        
        I wasn’t saying that this problem is “adversarial” in the “you’re punishing Bayesians therefore I don’t have to win” way; I agree that that would be a completely invalid argument. I was saying “if you want me to succeed even when theta is chosen by someone who doesn’t like me after pi is chosen, I need a prior over theta which depends on pi.” Then everything works out, except that Robins and Wasserman complain that this is torturing Bayesiansim to give a frequentist answer. To that, I shrug. You want me to get the frequentist result (“no matter which theta you pick I converge”) then the result will look frequentist. Not much surprise there.
        
        This is a very natural problem that comes up constantly.
        
        You realize that the Bayesian gets the right answer way faster than the frequentist in situations where theta is discrete, or sufficiently smooth, or parametric, right? I doubt you find problems like this where theta is non-parametric and utterly discontinuous “naturally” or “constantly”. But even if you do, the Bayesian will still succeed with a prior over theta that is independent of pi, except when the pi is so complicated and theta that is so discontinuous and so precisely tailored to hiding information in places that pi makes it very very difficult to observe that the only way you can learn theta is by knowing that it’s been tailored to that particular pi. (The frequentist is essentially always assuming that theta is tailored to pi in this way, because they’re essentially acting like theta might have been selected by an adversary, because that’s what you do if you want to converge in all cases.) And even in that case the Bayesian can succeed by putting a prior on theta that depends on pi. What’s the problem?
        
        Imagine there’s a game where the two of us will both toss an infinite number of uncorrelated fair coins, and then check which real numbers are encoded by these infinite bit sequences. Using any sane prior, I’ll assign measure zero to the event “we got the same real number.” If you’re then like “Aha! But what if my coin actually always returns the same result as yours?” then I’m going to shrug and use a prior which assigns some non-zero probability to a correlation between our coins.
        
        Robins and Wasserman’s game is similar. We’re imagining a non-parametric theta that’s very difficult to learn about, which is like the first infinite coin sequence (and their example does require that it encode infinite information). Then we also imagine that there’s some function pi which makes certain places easier or harder to learn about, which is like the second coin sequence. Robins and Wasserman claim, roughly, that for some finite set of observations and sufficiently complicated pi, a reasonable Bayesian will place ~zero probability on theta just happening to hide all its terrible discontinuities in that pi in just such a way that the only way you can learn theta is by knowing that it is one of the thetas that hides its information in that particular pi; this would be like the coin sequences coinciding. Fine, I agree that under sane priors and for sufficiently complex functions pi, that event has measure zero—if theta is as unstructured as you say, it would take an infinite confluence of coincident events to make it one of the thetas that happens to hide all its important information precisely such that this particular pi makes it impossible to learn.
        
        If you then say “Aha! Now I’m going to score you by your performance against precisely those thetas that hide in that pi!” then I’m going to shrug and require a prior which assigns some non-zero probability to theta being one of the thetas that hides its info in pi.
        
        That normally wouldn’t require any surgery to the intuitive prior (I place positive but small probability on any finite pair of sequences of coin tosses being identical), but if we’re assuming that it actually takes an infinite confluence of coincident events for theta to hide its info in pi and you still want to measure me against thetas that do this, then yeah, I’m going to need a prior over theta that depends on pi. You can cry “that’s violating the spirit of Bayes” all you want, but it still works.
        
        And in the real world, I do want a prior which can eventually say “huh, our supposedly independent coins have come up the same way 2^trillion times, I wonder if they’re actually correlated?” or which can eventually say “huh, this theta sure seems to be hiding lots of very important information in the places that pi makes it super hard to observe, I wonder if they’re actually correlated?” so I’m quite happy to assign some (possibly very tiny) non-zero prior probability on a correlation between the two of them. Overall, I don’t find this problem perturbing.
        
        You can’t really say “oh I believe in the likelihood principle,” and then rule out examples where the principle fails as unnatural or adversarial.
        
        I agree completely!
        IlyaShpitser 24 Oct 2015 18:47 UTC
        2 points
        Parent
        
        but it still works
        
        Sure, as long as you shrug and do what works, we have nothing to discuss :).
        
        I do agree that the insight that makes this go through is basically Frequentist, regardless of setup. All the magic happened in the prior before you started.