That said, I have quibbles that relate to the philosophical import ascribed to the beta distribution:
the beta distribution is an excellent exemplar of the notion of the comparative weight of evidence in the prior vs the data, but the notion is much more general;
priors should ideally reflect the actual information at one’s disposal, and thus should rarely actually be conjugate;
it’s controversial to claim that alpha = beta = 1 expresses no prior knowledge; other proposals include the improper alpha = beta = 0 and Jeffreys’ prior, alpha = beta = 0.5.
And one other complaint: using the notion of picking a “best” value of theta for prediction to motivate the subsequent discussion was a misstep. If prediction is the goal, then the Bayesian procedure is to formulate the joint distribution of theta and the as-yet-unobserved data and then treat theta as a nuisance parameter and integrate over it.
In spite of the above criticisms, I consider this post yeoman’s work—it deserves more upvotes than I can give it.
Thank you very much for the compliments, and for the honest criticism!
I am still thinking about your comment, and I intend to write a detailed response to it after I have thought about your criticisms more completely. In the meantime, though, I wanted to say that the feedback is very much appreciated!
After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn’t my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I’ll make sure to stress that the beta distribution is just one of many great conjugate priors!
I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?
I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?
Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I’ll change that in my next draft of this document.
Again, thank you so much for the detailed criticism; it is very much appreciated! =)
The improper alpha = beta = 0 prior, sometimes known as Haldane’s prior, is derived using an invariance argument in Jaynes’s 1968 paper Prior Probabilities. I actually don’t trust that argument—I find the critiques of it here compelling.
Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.
I have mostly used the uniform prior myself in the past, although I think in the future I’ll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.
This is a fantastic post! Well done.
That said, I have quibbles that relate to the philosophical import ascribed to the beta distribution:
the beta distribution is an excellent exemplar of the notion of the comparative weight of evidence in the prior vs the data, but the notion is much more general;
priors should ideally reflect the actual information at one’s disposal, and thus should rarely actually be conjugate;
it’s controversial to claim that alpha = beta = 1 expresses no prior knowledge; other proposals include the improper alpha = beta = 0 and Jeffreys’ prior, alpha = beta = 0.5.
And one other complaint: using the notion of picking a “best” value of theta for prediction to motivate the subsequent discussion was a misstep. If prediction is the goal, then the Bayesian procedure is to formulate the joint distribution of theta and the as-yet-unobserved data and then treat theta as a nuisance parameter and integrate over it.
In spite of the above criticisms, I consider this post yeoman’s work—it deserves more upvotes than I can give it.
Thank you very much for the compliments, and for the honest criticism!
I am still thinking about your comment, and I intend to write a detailed response to it after I have thought about your criticisms more completely. In the meantime, though, I wanted to say that the feedback is very much appreciated!
After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn’t my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I’ll make sure to stress that the beta distribution is just one of many great conjugate priors!
I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?
I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?
Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I’ll change that in my next draft of this document.
Again, thank you so much for the detailed criticism; it is very much appreciated! =)
The improper alpha = beta = 0 prior, sometimes known as Haldane’s prior, is derived using an invariance argument in Jaynes’s 1968 paper Prior Probabilities. I actually don’t trust that argument—I find the critiques of it here compelling.
Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.
I have mostly used the uniform prior myself in the past, although I think in the future I’ll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.