This is an excellent article. However, I did have the same philosophical problem that Cyan gave in this bullet point:
priors should ideally reflect the actual information at one’s disposal, and thus should rarely actually be conjugate;
You seem to suggest that conjugate prior distributions are “smart” because they update in a computationally tractable way. Certainly, as a concession to practical necessity, we have to take computational tractability into account. But it is controversial to think of doing this as part of the ideal epistemology that we are trying to approximate.
Also, I found myself confused at a few points near the beginning. You write
While going about your daily activities, you observe an event of type x. Because you’re a good Bayesian, you have some internal parameter \beta which represents your belief that x will occur.
Now, you’re familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of x is a datapoint, and thus you’ll want to modify \beta. But how much should this datapoint influence \beta?
At first, I misread you as saying, in effect, “Given that x occurs, what should be your updated probability that x occurs?” But, of course, your updated probability, conditioned on x’s occurring, that x occurs, should be 1.
I also misunderstood you to be proposing to consider the probability of the probability of a given event being such-and-such. That is, I thought that you were proposing to consider a probability of the form P(P(x | y) = p | z), where x, y, and z are events, and p is a number in [0,1]. But, as I understand it, this is not a well-formed notion in Bayesian epistemology.
I think that my confusion arose from your calling \beta an “internal parameter”. But, from the subsequent discussion, it seems better to think of \beta as an unknown parameter fed into whatever physical process generated x. For example, \beta could be an unknown parameter fed into a pseudo-random number generator that was observed to output the number x.
This is an excellent article. However, I did have the same philosophical problem that Cyan gave in this bullet point:
You seem to suggest that conjugate prior distributions are “smart” because they update in a computationally tractable way. Certainly, as a concession to practical necessity, we have to take computational tractability into account. But it is controversial to think of doing this as part of the ideal epistemology that we are trying to approximate.
Also, I found myself confused at a few points near the beginning. You write
At first, I misread you as saying, in effect, “Given that x occurs, what should be your updated probability that x occurs?” But, of course, your updated probability, conditioned on x’s occurring, that x occurs, should be 1.
I also misunderstood you to be proposing to consider the probability of the probability of a given event being such-and-such. That is, I thought that you were proposing to consider a probability of the form P(P(x | y) = p | z), where x, y, and z are events, and p is a number in [0,1]. But, as I understand it, this is not a well-formed notion in Bayesian epistemology.
I think that my confusion arose from your calling \beta an “internal parameter”. But, from the subsequent discussion, it seems better to think of \beta as an unknown parameter fed into whatever physical process generated x. For example, \beta could be an unknown parameter fed into a pseudo-random number generator that was observed to output the number x.