alex_zag_al comments on A Fervent Defense of Frequentist Statistics

alex_zag_al 10 Feb 2014 8:17 UTC
7 points
On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.

Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold).

In a post addressing a crowd where a lot of people read Jaynes, you’re not addressing the Jaynesian perspective on where priors come from.

When E. T. Jaynes does statistics, the assumptions are made very clear.

The Jaynesian approach is this:
- your prior knowledge puts constraints on the prior distribution,
- and the prior distribution is the distribution of maximum entropy that meets all those constraints.
The assumptions are explicit because the constraints are explicit.

As an example, imagine that you’re measuring something you’ve measured before. In a lot of situations, you’ll end up with constraints on the mean and variance of your prior distribution. This is because you reason in the following way:
- You think that your best estimate for the next measurement is the average of your previous measurements.
- Your previous estimates have given you a sense of the general scale of the errors you get. You express this as an estimate of the squared difference between your estimated next measurement and your real next measurement.
If we take “best estimate” to mean “minimum expected least squared error”, then these are constraints on the mean and variance of the prior distribution. If you maximize entropy subject to these constraints, you get a normal distribution. And that’s your prior.

The assumptions are quite explicit here. You’ve assumed that these measurements are measuring some unchanging quantity in an unbiased way, and that the magnitude of the error tends to stay the same. And you haven’t assumed any other relationship between the next measurement and the previous ones. You definitely didn’t assume any autocorrelation, for example, because you didn’t use the previous measurement in any of your estimates/constraints.

But since we’re not assuming that the data are generated from the prior, we don’t have the corresponding guarantee. Which brings up the question, what do we have? What’s the point of this?

A paper I’m reading right now (Portilla & Simoncelli, 2000, “A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients”) puts it this way:

Guarantee: “The maximum entropy density is optimal in the sense that it does not introduce any constraints on the [prior] besides those [we assumed].”

It’s not a performance guarantee. Just a guarantee against hidden assumptions, I guess?

Despite apparently having nothing to do with performance, it makes a lot of sense from within Jaynes’s perspective. To him, probability theory is logic: figuring out which conclusions are supported by your premises. It just extends deductive logic, allowing you to see to what degree each conclusion is supported by your premises.

And, I think your criticism of the Dutch Book argument applies here: is your time better spent making sure you don’t have any hidden assumptions, or writing more programs to make more money? But it definitely makes your assumptions explicit, that’s just part of the whole “prob theory as logic” paradigm.

(But I don’t really understand what it means to not introduce any constraints besides those assumed, or why maxent is the procedure that achieves this. That’s why I quoted the guarantee from someone who I assume does understand)
- jsteinhardt 10 Feb 2014 8:31 UTC
  2 points
  Parent
  
  The assumptions are explicit because the constraints are explicit.
  
  What does it mean to “assume that the prior satisfies these constraints”? As you already seem to indicate later in your comment, the notion of “a prior satisfying a constraint” is pretty nebulous. It’s unclear what concrete statement about the world this would correspond to. So I still don’t think this constitutes a particularly explicit assumption.
  
  And, I think your criticism of the Dutch Book argument applies here: is your time better spent making sure you don’t have any hidden assumptions, or writing more programs to make more money?
  
  I’m responding to arguments that others have raised in the past that Bayesian methods make assumptions explicit while frequentist methods don’t. If I then show that frequentist methods also make explicit assumptions, it seems like a weird response to say “oh, well who cares about explicit assumptions anyways?”
  - alex_zag_al 10 Feb 2014 8:38 UTC
    2 points
    Parent
    
    it seems like a weird response to say “oh, well who cares about explicit assumptions anyways?”
    
    Yeah, sorry. I was getting a little off topic there. It’s just that in your post, you were able to connect the explicit assumptions being true to some kind of performance guarantee. Here I was musing on the fact that I couldn’t. It was meant to undermine my point, not to support it.
    
    What does it mean to “assume that the prior satisfies these constraints”?
    
    ?? The answer to this is so obvious that I think I’ve misunderstood you. In my example, the constraints are on moments of the prior density. In many other cases, the constraints are symmetry constraints, which are also easy to express mathematically.
    
    But then you bring up concrete statements about the world? Are you asking how you get from your prior knowledge about the world to constraints on the prior distribution?
    
    EDIT: you don’t “assume a constraint”, a constraint follows from an assumption. Can you re-ask the question?
    - jsteinhardt 10 Feb 2014 8:54 UTC
      2 points
      Parent
      
      Yeah, sorry. I was getting a little off topic there. It’s just that in your post, you were able to connect the explicit assumptions being true to some kind of performance guarantee. Here I was musing on the fact that I couldn’t. It was meant to undermine my point, not to support it.
      
      Ah my bad! Now I feel silly :).
      
      But then you bring up concrete statements about the world? Are you asking how you get from your prior knowledge about the world to constraints on the prior distribution?
      
      So the prior is this thing you start with, and then you get a bunch of data and update it and get a posterior. In general it’s pretty unclear what constraints on the prior will translate to in terms of the posterior. Or at least, I spent a while musing about this and wasn’t able to make much progress. And furthermore, when I look back, even in retrospect it’s pretty unclear how I would ever test if my “assumption” held if it was a constraint on the prior. I mean sure, if there’s actually some random process generating my data, then I might be able to say something, but that seems like a pretty rare case… sorry if I’m being unclear, hopefully that was at least somewhat more clear than before. Or it’s possible that I’m just nitpicking pointlessly.
      - alex_zag_al 10 Feb 2014 9:18 UTC
        4 points
        Parent
        Hmm. Considering that I was trying to come up with an example to illustrate how explicit the assumptions are, the assumptions aren’t that explicit in my example are they?
        
        Prior knowledge about the world --> mathematical constraints --> prior probability distribution
        
        The assumptions I used to get the constraints are that the best estimate of your next measurement is the average of your previous ones, and that the best estimate of its squared deviation from that average is some number s^2, maybe the variance of your previous observations. But those aren’t states of the world, those are assumptions about your inference behavior.
        
        Then I added later that the real assumptions are that you’re making unbiased measurements of some unchanging quantity mu, and that the mechanism of your instrument is unchanging. These are facts about the world. But these are not the assumptions that I used to derive the constraints, and I don’t show how they lead to the former assumptions. In fact, I don’t think they do.
        
        Well. Let me assure you that the assumptions that lead to the constraints are supposed to be facts about the world. But I don’t see how that’s supposed to work.