So just to be clear. There are two things, the prior probability, which is the value P(H|I), and the back ground information which is ‘I’. So P(H|D,I_1) is different from P(H|D,I_2) because they are updates using the same data and the same hypothesis, but with different partial background information, they are both however posterior probabilities. And the priors P(H_I_1) may be equal to P(H|I_2) even if I_1 and I_2 are radically different and produce updates in opposite directions given the same data. P(H|I) is still called the prior probability, but it is smething very differnet from the background information which is essentially just I.
Is this right? Let me be more specific.
Let’s say my prior information is case1, then P( second ball is R| first ball is R & case1) = 4⁄9
If my prior information was case2, then P( second ball is R| first ball is R & case2) = 2⁄3 [by the rule of succession]
and P( first ball is R| case1) = 50% = P( first ball is R|case2)
This is why different prior information can make you learn in different directions, even if two prior informations produce the same prior probability?
Please let me know if i am making any sort of mistake. Or if I got it right, either way.
You got it right. The three different cases correspond to different joint distributions over sequences of outcomes. Prior information that one of the cases obtains amounts to picking one of these distributions (of course, one can also have weighted combinations of these distributions if there is uncertainty about which case obtains). It turns out that in this example, if you add together the probabilities of all the sequences that have a red ball in the second position, you will get 0.5 for each of the three distributions. So equal prior probabilities. But even though the terms sum to 0.5 in all three cases, the individual terms will not be the same. For instance, prior information of case 1 would assign a different probability to RRRRR (0.004) than prior information of case 2 (0.031).
So the prior information is a joint distribution over sequences of outcomes, while the prior probability of the hypothesis is (in this example at least) a marginal distribution calculated from this joint distribution. Since multiple joint distributions can give you the same marginal distribution for some random variable, different prior information can correspond to the same prior probability.
When you restrict attention to those sequences that have a red ball in the first position, and now add together the (appropriately renormalized) joint probabilities of sequences with a red ball in the second position, you don’t get the same number with all three distributions. This corresponds to the fact that the three distributions are associated with different learning rules.
So just to be clear. There are two things, the prior probability, which is the value P(H|I), and the back ground information which is ‘I’. So P(H|D,I_1) is different from P(H|D,I_2) because they are updates using the same data and the same hypothesis, but with different partial background information, they are both however posterior probabilities. And the priors P(H_I_1) may be equal to P(H|I_2) even if I_1 and I_2 are radically different and produce updates in opposite directions given the same data. P(H|I) is still called the prior probability, but it is smething very differnet from the background information which is essentially just I.
Is this right? Let me be more specific.
Let’s say my prior information is case1, then P( second ball is R| first ball is R & case1) = 4⁄9
If my prior information was case2, then P( second ball is R| first ball is R & case2) = 2⁄3 [by the rule of succession]
and P( first ball is R| case1) = 50% = P( first ball is R|case2)
This is why different prior information can make you learn in different directions, even if two prior informations produce the same prior probability?
Please let me know if i am making any sort of mistake. Or if I got it right, either way.
You got it right. The three different cases correspond to different joint distributions over sequences of outcomes. Prior information that one of the cases obtains amounts to picking one of these distributions (of course, one can also have weighted combinations of these distributions if there is uncertainty about which case obtains). It turns out that in this example, if you add together the probabilities of all the sequences that have a red ball in the second position, you will get 0.5 for each of the three distributions. So equal prior probabilities. But even though the terms sum to 0.5 in all three cases, the individual terms will not be the same. For instance, prior information of case 1 would assign a different probability to RRRRR (0.004) than prior information of case 2 (0.031).
So the prior information is a joint distribution over sequences of outcomes, while the prior probability of the hypothesis is (in this example at least) a marginal distribution calculated from this joint distribution. Since multiple joint distributions can give you the same marginal distribution for some random variable, different prior information can correspond to the same prior probability.
When you restrict attention to those sequences that have a red ball in the first position, and now add together the (appropriately renormalized) joint probabilities of sequences with a red ball in the second position, you don’t get the same number with all three distributions. This corresponds to the fact that the three distributions are associated with different learning rules.
No really, i really want help. Please help me understand if I am confused, and settle my anxiety if I am not confused.