I think Jaynes more or less defines ‘Bayesian methods’ to be those gadgets which fall out of the Cox-Polya desiderata (i.e. probability theory as extended logic). Actually, this can’t be the whole story given the following quote on page xxiii:
“It is true that all ‘Bayesian’ calculations are included automatically as particular cases of our rules; but so are all ‘frequentist’ calculations. Nevertheless, our basic rules are broader than either of these.”
In any case, Maximum entropy gives you the pre-Bayesian ensemble (I got that word from here) which then allow the Bayesian crank to turn. In particular, I think Maximum entropy methods are not Bayesian in the sense that they do not follow from the Cox-Polya desiderata.
Jaynes recommends MaxEnt for situations when “the Bayesian apparatus”, consisting of “a model, a sample space, hypothesis space, prior probabilities, sampling distribution” is not yet available, and only a sample space can be defined.
In particular, I think Maximum entropy methods are not Bayesian in the sense that they do not follow from the Cox-Polya desiderata.
IIRC, this was my understanding of Jaynes’s position on maxent:
the Cox-Polya desiderata say that multiple allowed derivations of a problem ought to all lead to the same answer
if we consider a list of identifiers about which we know nothing, and we ask whether the first one is more likely than the nth one, then we should answer that they are equal, because if we say either greater than or less than, we could shuffle the list and get a contradictory answer. By induction, we ought to say that all members of the list are equiprobable, which only allows entries to be 1/n probable.
Maxent is just the same idea, abstract and applied to non-list thingies. (I haven’t actually gotten this far, but it seems like the obvious next step.)
The arguments seem to me to be as Bayesian as anything in his building up of Bayesian methods from the Cox-Polya criteria.
I think this is not so important, but it helpful to think about nonetheless. I guess the first step is to define what is meant by ‘Bayesian’. In my original comment, I took one necessary condition to be that a Bayesian gadget is one which follows from the Cox-Polya desiderata. It might be better to define it to be one which uses Bayes’ Theorem. I think in either case, Maxent fails to meet the criteria.
Maxent produces the distribution on the sample space which maximizes entropy subject to any known constraints which presumably come from data. If there are no constraints, then one gets the principle of indifference which can also be gotten straight out of the Cox-Polya desiderata as you say. But I think these are two different approaches to the same target. Maxent needs something new—namely Shannon’s information entropy (by ‘new’ I mean new w.r.t. Cox-Polya). Furthermore, the derivation of Maxent is really different from the derivation of the principle of indifference from Cox-Polya.
I could be completely off here, but I believe the principle of indifference argument is generalized by the transformation group stuff. I think this because I can see the action of the symmetric group (this is the group (group in the abstract algebra sense) of permutations) on the hypothesis space in the principle of indifference stuff. Anyway, hopefully we’ll get up to that chapter!
Upon further study, I disagree with myself here. It does seem like entropy as a measurement of uncertainty in probability distributions does more or less fall out of the Cox Polya desiderata. I guess that ‘common sense’ one is pretty useful!
Re: Preface
Is there a good reason why the Maximum Entropy method is treated as distinct from the Bayesian, rather than simply as a method for generating priors?
I think Jaynes more or less defines ‘Bayesian methods’ to be those gadgets which fall out of the Cox-Polya desiderata (i.e. probability theory as extended logic). Actually, this can’t be the whole story given the following quote on page xxiii:
“It is true that all ‘Bayesian’ calculations are included automatically as particular cases of our rules; but so are all ‘frequentist’ calculations. Nevertheless, our basic rules are broader than either of these.”
In any case, Maximum entropy gives you the pre-Bayesian ensemble (I got that word from here) which then allow the Bayesian crank to turn. In particular, I think Maximum entropy methods are not Bayesian in the sense that they do not follow from the Cox-Polya desiderata.
Jaynes recommends MaxEnt for situations when “the Bayesian apparatus”, consisting of “a model, a sample space, hypothesis space, prior probabilities, sampling distribution” is not yet available, and only a sample space can be defined.
IIRC, this was my understanding of Jaynes’s position on maxent:
the Cox-Polya desiderata say that multiple allowed derivations of a problem ought to all lead to the same answer
if we consider a list of identifiers about which we know nothing, and we ask whether the first one is more likely than the nth one, then we should answer that they are equal, because if we say either greater than or less than, we could shuffle the list and get a contradictory answer. By induction, we ought to say that all members of the list are equiprobable, which only allows entries to be 1/n probable.
hence, we get the Principle of Indifference. (Points 1-3 are my version of chapter 2 or 3, IIRC.)
Maxent is just the same idea, abstract and applied to non-list thingies. (I haven’t actually gotten this far, but it seems like the obvious next step.)
The arguments seem to me to be as Bayesian as anything in his building up of Bayesian methods from the Cox-Polya criteria.
I think this is not so important, but it helpful to think about nonetheless. I guess the first step is to define what is meant by ‘Bayesian’. In my original comment, I took one necessary condition to be that a Bayesian gadget is one which follows from the Cox-Polya desiderata. It might be better to define it to be one which uses Bayes’ Theorem. I think in either case, Maxent fails to meet the criteria.
Maxent produces the distribution on the sample space which maximizes entropy subject to any known constraints which presumably come from data. If there are no constraints, then one gets the principle of indifference which can also be gotten straight out of the Cox-Polya desiderata as you say. But I think these are two different approaches to the same target. Maxent needs something new—namely Shannon’s information entropy (by ‘new’ I mean new w.r.t. Cox-Polya). Furthermore, the derivation of Maxent is really different from the derivation of the principle of indifference from Cox-Polya.
I could be completely off here, but I believe the principle of indifference argument is generalized by the transformation group stuff. I think this because I can see the action of the symmetric group (this is the group (group in the abstract algebra sense) of permutations) on the hypothesis space in the principle of indifference stuff. Anyway, hopefully we’ll get up to that chapter!
Upon further study, I disagree with myself here. It does seem like entropy as a measurement of uncertainty in probability distributions does more or less fall out of the Cox Polya desiderata. I guess that ‘common sense’ one is pretty useful!