After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn’t my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I’ll make sure to stress that the beta distribution is just one of many great conjugate priors!
I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?
I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?
Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I’ll change that in my next draft of this document.
Again, thank you so much for the detailed criticism; it is very much appreciated! =)
The improper alpha = beta = 0 prior, sometimes known as Haldane’s prior, is derived using an invariance argument in Jaynes’s 1968 paper Prior Probabilities. I actually don’t trust that argument—I find the critiques of it here compelling.
Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.
I have mostly used the uniform prior myself in the past, although I think in the future I’ll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.
After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn’t my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I’ll make sure to stress that the beta distribution is just one of many great conjugate priors!
I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?
I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?
Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I’ll change that in my next draft of this document.
Again, thank you so much for the detailed criticism; it is very much appreciated! =)
The improper alpha = beta = 0 prior, sometimes known as Haldane’s prior, is derived using an invariance argument in Jaynes’s 1968 paper Prior Probabilities. I actually don’t trust that argument—I find the critiques of it here compelling.
Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.
I have mostly used the uniform prior myself in the past, although I think in the future I’ll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.