I think the answer is that PT:tLoS is not a textbook. It is part of a conversation amongst academics about the fundamentals of statistics. It gets mistaken for a textbook because it appears to be a description of the basics, but it actually participates in an argument about what the basics ought to be, and so includes the authors statement of what he thinks the basics are.
Writing a textbook on Bayesian statistics is an important challenge, but you could not possibly follow the plan of PT:tLoS. Not only is the mathematics of chapter 2, proving Cox’s theorem, too hard for first year undergraduates, but its perspective, of deducing the rules from general considerations, is too sophisticated. It cannot possible precede the elementary sampling theory of chapter 3 in an undergraduate curriculum.
Not to mention the fact that he vastly overstates the impact of Cox’s theorem and never honestly writes down the assumptions for the version of it he proves.
My view is that writing a textbook on Bayesian statistics is very difficult because it is hard to order the material in a satisfactory way.
Here is why I’m wrong: when we teach calculus we teach differentiation. Then we say that integrals are important because they are areas under curves. Then we drill our pupils in the computation of integrals by finding anti-derivatives. It is only two or three years later that we introduce the Riemann integral in order to have a rigorous definition of the area under a curve that can be used to formalise the statement of the fundamental theorem of calculus.
It seems natural to proceed in the same spirit: teach Bayesian updating, and drill our students in these methods of calculation. The fact that there is only one updating rule that really works is mentioned but not proved. Those who follow the maths track get to see Cox’s theorem two or three years later.
I’m not asking about teaching or textbook-writing; my question is a tangent to that discussion. Smoofra invoked a notion of “impact”; I’m trying to determine how smoofra rates Cox’s theorem on that scale.
I know of a number of paths to the Bayesian approach:
Dutch book arguments (coherence of bets)
more elaborate decision theory arguments in the same vein (coherence of decisions under uncertainty)
the complete class theorem (the set of decision rules “admissible” (non-dominated) in a frequentist sense is precisely the set of Bayes decision rules)
de Finetti’s theorem (exchangeability of observables implies the existence of a prior and posterior for a parameter as a mathematical fact)
Cox’s theorem (a representation of plausibility as a single real number consistent with Boolean logic must be isomorphic to probability)
My question is about which justifications smoofra knows (in particular, have I missed any?) and what impact each of them has.
FWIW, in high school my very first introduction to integrals used the high-school version of Riemann integration with a simple definite integral that was solved with algebra and the notion of the limit of a sequence. Once it was demonstrated that the area under the curve was the anti-derivative in that case, we got the statement of the Fundamental Theorem of Calculus and drilling in anti-derivatives and integration by parts etc.
I think it’s an important theorem, but if you want to talk about it you need to say what the theorem actually says in math, not try to badly paraphrase it in English and then claim it’s all the justification you ever need for the Bayesian approach.
The truth is that when you rigorously state the assumptions, they’re actually pretty strong, and this fact is dodged and evaded and ignored throughout Jayne’s treatment of the subject.
Differentiability is strong in a mathematical sense, but I’m not sure I want a system of reasoning about plausibility that doesn’t vary smoothly with smoothly varying evidence. I guess the answer is to actually look at such systems, but I don’t have the chops to follow Halperin (or this paper that claims to prove the theorem under very weak assumptions that exclude Halperin’s counterexample).
See my reply to AlanCrowe for a more precise statement of what I was asking.
It’s not just differentiability. Why use real numbers at all? Why does P(A&B|C) have to be a function of P(A|C) and P(B|A&C)? Jaynes tries to prevent the reader from even thinking about these questions. I’m not arguing against his conclusion, but his argument is incomplete and inadequate, and he tries to cover it up.
This paper formally states all of the assumptions necessary in the proof of Cox’s theorem (R1-R5 in the paper) and notes where the controversies are before going on with the proof. R5 is obviously not well supported and the major dispute over R1 is whether plausibilities must be universally comparable. (R1 and R5 correspond to your two major objections above, in order).
As requested below, a top level post would be very interesting
I’m working on a post on this topic, but I don’t think I can really adequately address what I don’t like about how Jayne’s presents the foundations of probability theory without presenting it myself the way I think it ought to be. And to do that I need to actually learn some things I don’t know yet, so it’s going to be a bit of a project.
In section 1.7 The basic desiderata, the decision to use real numbers is emphasised as one of three basic desiderata and tagged as equation 1.28. Jaynes devotes section 1.8 Comments, chewing over this point for a little more than a page, before punting the issue to Appendix A. He writes
These remarks are interjected to point out that there is a large unexplored area of possible generalizations and extensions of the theory to be developed here; perhaps this may inspire others to try their hand at developing ‘multidimensional theories’ of mental activity, which would more and more resemble the behaviour of actual human brains—not all of which is undesirable.Such a theory, if successful, might have an importance beyond our present ability to imagine.
Perhaps Jaynes is trying here to prevent the reader from even thinking about these questions, but if so his strategy is more bold and unconventional than I can fathom.
As for P(AB|C) = F[P(A|C),P(B|A&C)] that is equation 2.1. Jaynes considers an alternative in equ 2.2 and then discusses how to organize an exhaustive case split, before refer the reader interested in “Carrying out this somewhat tedious analysis” to Tribus
I confess that I have not worked through the 11 cases that Jaynes says need to be checked.
Notice though that in graduate level texts, dumping shit on the reader like this is standard practise. Jaynes is unusually helpful and complete for a text at this level. Compare it for example to Categories for the Working Mathematician. I like CftWM. MacLane takes pains to organise his material and to direct the reader’s attention to points requiring special care. Yet, following the conventions of the genre, he ends page 9 with
More explicity, given a metacategory of objects and arrows, its arrows, with the given composition, satisfy the “arrows-only” axioms; conversely, an arrows-only metacategory satisfies the objects-and-arrows axioms when the identity arrows, defined as above, are taken as the objects (Proof as exerecise)
Yes, Jaynes argument is incomplete, but by being more complete than is customary, even compared to works that are admired for their thoroughness and clarity, Jaynes has bloated his book to 727 pages. Criticising his omission of tedious case analysis is unfair.
Perhaps Jaynes is trying here to prevent the reader from even thinking about these questions, but if so his strategy is more bold and unconventional than I can fathom.
His strategy is to make them look like trivial details, things that can be safely assumed, things that only a pedantic mathematician could care about, things that don’t matter.
As for P(AB|C) = F[P(A|C),P(B|A&C)] that is equation 2.1. Jaynes considers an alternative in equ 2.2 and then discusses how to organize an exhaustive case split....
This part, in particular is what struck me as the most absolutely, monumentally awful part of the book. The other cases jaynes considers in his “exhaustive case split” are only a tiny, minuscule, arbitrary set of the things that P(AB|C) might depend on. Why should P(AB|C) not depend on the specific structure of the propositions themselves?
What bothers me so much about this part of the book isn’t so much that the argument is incomplete, but that Jaynes is downright deceptive in his attempts to convince the reader that it is a complete rigorous justification for the Bayesian approach. Jaynes (and Eliezer) make it sound like Cox proved a generic Dutch book argument against anyone who doesn’t use the Bayesian approach. There may indeed be such a theorem, but Cox’s theorem just isn’t it.
the other cases jaynes considers in his “exhaustive case split” are only a tiny, minuscule, arbitrary set of the things that P(AB|C) might depend on.
That’s a good point. I suspect that the oversight is due to the fact that the truth value of a conjunction of propositions depends only on the truth values of the constituent propositions, and not on any other structure they might have. I conjecture that the desideratum that propositions with the same truth value have the same plausibility could be used to demonstrate that P(AB|C) is not a function of any additional structure of the propositions, but Jaynes does not highlight the issue or perform any such demonstration.
I think the answer is that PT:tLoS is not a textbook. It is part of a conversation amongst academics about the fundamentals of statistics. It gets mistaken for a textbook because it appears to be a description of the basics, but it actually participates in an argument about what the basics ought to be, and so includes the authors statement of what he thinks the basics are.
Writing a textbook on Bayesian statistics is an important challenge, but you could not possibly follow the plan of PT:tLoS. Not only is the mathematics of chapter 2, proving Cox’s theorem, too hard for first year undergraduates, but its perspective, of deducing the rules from general considerations, is too sophisticated. It cannot possible precede the elementary sampling theory of chapter 3 in an undergraduate curriculum.
Not to mention the fact that he vastly overstates the impact of Cox’s theorem and never honestly writes down the assumptions for the version of it he proves.
What is your view as to the appropriate place for Cox’s theorem in the collection of justifications for the Bayesian approach?
My view is that writing a textbook on Bayesian statistics is very difficult because it is hard to order the material in a satisfactory way.
Here is why I’m wrong: when we teach calculus we teach differentiation. Then we say that integrals are important because they are areas under curves. Then we drill our pupils in the computation of integrals by finding anti-derivatives. It is only two or three years later that we introduce the Riemann integral in order to have a rigorous definition of the area under a curve that can be used to formalise the statement of the fundamental theorem of calculus.
It seems natural to proceed in the same spirit: teach Bayesian updating, and drill our students in these methods of calculation. The fact that there is only one updating rule that really works is mentioned but not proved. Those who follow the maths track get to see Cox’s theorem two or three years later.
I’m not asking about teaching or textbook-writing; my question is a tangent to that discussion. Smoofra invoked a notion of “impact”; I’m trying to determine how smoofra rates Cox’s theorem on that scale.
I know of a number of paths to the Bayesian approach:
Dutch book arguments (coherence of bets)
more elaborate decision theory arguments in the same vein (coherence of decisions under uncertainty)
the complete class theorem (the set of decision rules “admissible” (non-dominated) in a frequentist sense is precisely the set of Bayes decision rules)
de Finetti’s theorem (exchangeability of observables implies the existence of a prior and posterior for a parameter as a mathematical fact)
Cox’s theorem (a representation of plausibility as a single real number consistent with Boolean logic must be isomorphic to probability)
My question is about which justifications smoofra knows (in particular, have I missed any?) and what impact each of them has.
FWIW, in high school my very first introduction to integrals used the high-school version of Riemann integration with a simple definite integral that was solved with algebra and the notion of the limit of a sequence. Once it was demonstrated that the area under the curve was the anti-derivative in that case, we got the statement of the Fundamental Theorem of Calculus and drilling in anti-derivatives and integration by parts etc.
I think it’s an important theorem, but if you want to talk about it you need to say what the theorem actually says in math, not try to badly paraphrase it in English and then claim it’s all the justification you ever need for the Bayesian approach.
The truth is that when you rigorously state the assumptions, they’re actually pretty strong, and this fact is dodged and evaded and ignored throughout Jayne’s treatment of the subject.
Differentiability is strong in a mathematical sense, but I’m not sure I want a system of reasoning about plausibility that doesn’t vary smoothly with smoothly varying evidence. I guess the answer is to actually look at such systems, but I don’t have the chops to follow Halperin (or this paper that claims to prove the theorem under very weak assumptions that exclude Halperin’s counterexample).
See my reply to AlanCrowe for a more precise statement of what I was asking.
It’s not just differentiability. Why use real numbers at all? Why does P(A&B|C) have to be a function of P(A|C) and P(B|A&C)? Jaynes tries to prevent the reader from even thinking about these questions. I’m not arguing against his conclusion, but his argument is incomplete and inadequate, and he tries to cover it up.
This paper formally states all of the assumptions necessary in the proof of Cox’s theorem (R1-R5 in the paper) and notes where the controversies are before going on with the proof. R5 is obviously not well supported and the major dispute over R1 is whether plausibilities must be universally comparable. (R1 and R5 correspond to your two major objections above, in order).
As requested below, a top level post would be very interesting
thanks! I haven’t seen that one before.
I’m working on a post on this topic, but I don’t think I can really adequately address what I don’t like about how Jayne’s presents the foundations of probability theory without presenting it myself the way I think it ought to be. And to do that I need to actually learn some things I don’t know yet, so it’s going to be a bit of a project.
In section 1.7 The basic desiderata, the decision to use real numbers is emphasised as one of three basic desiderata and tagged as equation 1.28. Jaynes devotes section 1.8 Comments, chewing over this point for a little more than a page, before punting the issue to Appendix A. He writes
Perhaps Jaynes is trying here to prevent the reader from even thinking about these questions, but if so his strategy is more bold and unconventional than I can fathom.
As for P(AB|C) = F[P(A|C),P(B|A&C)] that is equation 2.1. Jaynes considers an alternative in equ 2.2 and then discusses how to organize an exhaustive case split, before refer the reader interested in “Carrying out this somewhat tedious analysis” to Tribus I confess that I have not worked through the 11 cases that Jaynes says need to be checked.
Notice though that in graduate level texts, dumping shit on the reader like this is standard practise. Jaynes is unusually helpful and complete for a text at this level. Compare it for example to Categories for the Working Mathematician. I like CftWM. MacLane takes pains to organise his material and to direct the reader’s attention to points requiring special care. Yet, following the conventions of the genre, he ends page 9 with
Yes, Jaynes argument is incomplete, but by being more complete than is customary, even compared to works that are admired for their thoroughness and clarity, Jaynes has bloated his book to 727 pages. Criticising his omission of tedious case analysis is unfair.
His strategy is to make them look like trivial details, things that can be safely assumed, things that only a pedantic mathematician could care about, things that don’t matter.
This part, in particular is what struck me as the most absolutely, monumentally awful part of the book. The other cases jaynes considers in his “exhaustive case split” are only a tiny, minuscule, arbitrary set of the things that P(AB|C) might depend on. Why should P(AB|C) not depend on the specific structure of the propositions themselves?
What bothers me so much about this part of the book isn’t so much that the argument is incomplete, but that Jaynes is downright deceptive in his attempts to convince the reader that it is a complete rigorous justification for the Bayesian approach. Jaynes (and Eliezer) make it sound like Cox proved a generic Dutch book argument against anyone who doesn’t use the Bayesian approach. There may indeed be such a theorem, but Cox’s theorem just isn’t it.
I’d like to see this discussed as a top level post. Care to take a stab at it Smoofra?
That’s a good point. I suspect that the oversight is due to the fact that the truth value of a conjunction of propositions depends only on the truth values of the constituent propositions, and not on any other structure they might have. I conjecture that the desideratum that propositions with the same truth value have the same plausibility could be used to demonstrate that P(AB|C) is not a function of any additional structure of the propositions, but Jaynes does not highlight the issue or perform any such demonstration.