Suppose you have a suit of events A1, A2, A3 … An, that correlate with event B with varying strength. You want to calculate the aposteriory probability of B using the Bayes theorem, and obtain a set of numbers. How do you decide which one is the correct one? Or you just must do these calculations several times and then pick the best indicator? Sorry, I suspect there is a simple answer...
Are you trying to find the probability of B given all n events, that is, Pr[B|A1, A2, …, An]? In that case, none of the calculations Pr[B|A1], Pr[B|A2], …, Pr[B|An] are useful, necessarily. In fact, even if each Ai individually makes B more likely, together they may make B less likely.
(For example, suppose we are rolling a fair 6-sided die, and take A1 = “We get 1 or 3”, A2 = “We get 2 or 3″, and B = “We get 1 or 2”. Then Pr[B] = 1⁄3 before we condition, since 2 out of 6 outcomes satisfy B. If we learn either A1 or A2, then Pr[B|A1] = Pr[B|A2] = 1⁄2, since 1 out of the remaining 2 outcomes satisfies B. However, if we learn both A1 and A2, then Pr[B|A1,A2] = 0, because then we know that the outcome must be 3.)
Thank you. I want to pick the exact A that would point me to B. But I apologize, I should have labeled them A (and -A), C (and -C), D (and -D)..., because they are actually different things that happen simultaneously with B. There might be (should be, even) interdependence between them (at least most of them), so I won’t use it as a very reliable indicator. Just a way to quickly estimate what I expect to see.
Yes, that’s it. It is just that some of the tests are much more expensive, to the point that I won’t be able to do them routinely,but others wwhich are quick and easy to perform might not give me the necessary information.
The value of a test A for learning about B is measured by the mutual information I(A;B). The tradeoff between this and how easy the test is to perform is left up to you.
Here is a brief overview of the subject. As far as notation goes: I want to distinguish the test A from its outcomes, which I will denote a and -a.
The information content I(a) from an outcome given by the formula I(a) = - log Pr[a]. (The log is often taken to be base 2, in which case the units of information are bits.) The formula is motivated by our desire that if tests A1, A2 are independent, I(a1 and a2) = I(a1) + I(a2); the information we gain from learning both outcomes at once is the sum of the information learned from each outcome separately.
The entropy of a test A is the expected information content from learning its outcome: H(A) = I(a) Pr[a] + I(-a) Pr[-a]. Intuitively, it measures our uncertainty about the outcome of A; it is maximized (at 1 bit) when a and -a are equally likely, and approaches 0 when either a or -a approaches certainty. Ultimately, H(B) is the parameter you’re trying to reduce in this problem.
We can easily condition on an outcome: H(B|a) is given by replacing all probabilities with conditional ones. It is our (remaining) uncertainty about B if we learn that a was the outcome of test A.
The conditional entropy H(B|A) is the expected value H(B|a) Pr[A] + H(B|-a) Pr[-a]. In other words, this is the expected uncertainty remaining about B after performing test A.
Finally, the mutual information I(A;B) = H(B) - H(B|A) measures the reduction in uncertainty about B from performing test A. As a result, it is a measure of the value of test A for learning about B. Irrelevantly but cutely, it is symmetric: I(A;B) = I(B;A).
...so if I perform tests A and B simultaneously 25 times and out of 25 of them obtain Pr[a], Pr[-a], Pr[b] and Pr[-b] and calculate I(A;B), and THEN I look at the results for A26, I should be able to predict B26, right? And if I(A;B)>I(C;B)>I(D;B), then I take test A as the most useful predictor? But if the set from which the sample was taken is large, and probably heterogenous, and there might be other factors I haven’t included in my analysis, then the test A might mislead me about the outcome of B. (Which will be Bayesian evidence, if it happens.) How many iterations should I run? Is there a rule of thumb?
Thank you for such helpful answers.
So there’s two potential sources of error in estimating I(A;B) from sample data:
The sample I(A;B) is a biased estimator of the true value of I(A;B), and will see slight patterns when there are none. (See this blog post, for example, for more information.)
Plus, of course, the sample will deviate slightly even from its expected value, so some tests will get “luckier” values than others.
Experimentally (I did a simulation), both of these have an effect on the order of 1/N, where N is the number of trials. So if you were comparing a relatively small number of tests, you should run enough iterations that 1/N is insignificant relative to whatever values of mutual information you end up obtaining. (These will be between 0 and 1, but may vary depending on how good your tests are.)
If you have a large number of tests to compare, you run into a third issue:
Although for the typical test, the error is on the order of 1/N, the error for the most misestimated test may be much larger; if that error exceeds the typical value of mutual information, the tests ranked most useful will merely be the ones most misestimated.
Not knowing how errors in mutual information estimates tend to be distributed, I would reason from Chebyshev’s inequality, which makes no assumptions about this. It suggests that the error should be multiplied by sqrt(T), where T is the number of tests, giving us an error on the order of sqrt(T)/N. So make N large enough that this is small.
Independently of the above, I suggest making up a toy model of your problem, in which you know the true value of all the tests and can run a simulation with a number of iterations that would be prohibitive in the real world. This will give you an idea of what to expect.
Oh, thank you. This was immensely useful. I now will pick some other object of study, and limit myself to a few tests (about 8). I kinda suspected I’ll have to obtain data for as many populations as possible, to estimate between-population variation, and for as many trial specimens as possible, but I didn’t know exactly how to check it for efficiency. Happy winter holidays to you!
Suppose you have a suit of events A1, A2, A3 … An, that correlate with event B with varying strength. You want to calculate the aposteriory probability of B using the Bayes theorem, and obtain a set of numbers. How do you decide which one is the correct one? Or you just must do these calculations several times and then pick the best indicator? Sorry, I suspect there is a simple answer...
Are you trying to find the probability of B given all n events, that is, Pr[B|A1, A2, …, An]? In that case, none of the calculations Pr[B|A1], Pr[B|A2], …, Pr[B|An] are useful, necessarily. In fact, even if each Ai individually makes B more likely, together they may make B less likely.
(For example, suppose we are rolling a fair 6-sided die, and take A1 = “We get 1 or 3”, A2 = “We get 2 or 3″, and B = “We get 1 or 2”. Then Pr[B] = 1⁄3 before we condition, since 2 out of 6 outcomes satisfy B. If we learn either A1 or A2, then Pr[B|A1] = Pr[B|A2] = 1⁄2, since 1 out of the remaining 2 outcomes satisfies B. However, if we learn both A1 and A2, then Pr[B|A1,A2] = 0, because then we know that the outcome must be 3.)
If this is not what you mean, please elaborate.
Thank you. I want to pick the exact A that would point me to B. But I apologize, I should have labeled them A (and -A), C (and -C), D (and -D)..., because they are actually different things that happen simultaneously with B. There might be (should be, even) interdependence between them (at least most of them), so I won’t use it as a very reliable indicator. Just a way to quickly estimate what I expect to see.
Are you, then, trying to find which event gives you the most information about whether or not B occurred?
Yes, that’s it. It is just that some of the tests are much more expensive, to the point that I won’t be able to do them routinely,but others wwhich are quick and easy to perform might not give me the necessary information.
The value of a test A for learning about B is measured by the mutual information I(A;B). The tradeoff between this and how easy the test is to perform is left up to you.
Here is a brief overview of the subject. As far as notation goes: I want to distinguish the test A from its outcomes, which I will denote a and -a.
The information content I(a) from an outcome given by the formula I(a) = - log Pr[a]. (The log is often taken to be base 2, in which case the units of information are bits.) The formula is motivated by our desire that if tests A1, A2 are independent, I(a1 and a2) = I(a1) + I(a2); the information we gain from learning both outcomes at once is the sum of the information learned from each outcome separately.
The entropy of a test A is the expected information content from learning its outcome: H(A) = I(a) Pr[a] + I(-a) Pr[-a]. Intuitively, it measures our uncertainty about the outcome of A; it is maximized (at 1 bit) when a and -a are equally likely, and approaches 0 when either a or -a approaches certainty. Ultimately, H(B) is the parameter you’re trying to reduce in this problem.
We can easily condition on an outcome: H(B|a) is given by replacing all probabilities with conditional ones. It is our (remaining) uncertainty about B if we learn that a was the outcome of test A.
The conditional entropy H(B|A) is the expected value H(B|a) Pr[A] + H(B|-a) Pr[-a]. In other words, this is the expected uncertainty remaining about B after performing test A.
Finally, the mutual information I(A;B) = H(B) - H(B|A) measures the reduction in uncertainty about B from performing test A. As a result, it is a measure of the value of test A for learning about B. Irrelevantly but cutely, it is symmetric: I(A;B) = I(B;A).
...so if I perform tests A and B simultaneously 25 times and out of 25 of them obtain Pr[a], Pr[-a], Pr[b] and Pr[-b] and calculate I(A;B), and THEN I look at the results for A26, I should be able to predict B26, right? And if I(A;B)>I(C;B)>I(D;B), then I take test A as the most useful predictor? But if the set from which the sample was taken is large, and probably heterogenous, and there might be other factors I haven’t included in my analysis, then the test A might mislead me about the outcome of B. (Which will be Bayesian evidence, if it happens.) How many iterations should I run? Is there a rule of thumb? Thank you for such helpful answers.
So there’s two potential sources of error in estimating I(A;B) from sample data:
The sample I(A;B) is a biased estimator of the true value of I(A;B), and will see slight patterns when there are none. (See this blog post, for example, for more information.)
Plus, of course, the sample will deviate slightly even from its expected value, so some tests will get “luckier” values than others.
Experimentally (I did a simulation), both of these have an effect on the order of 1/N, where N is the number of trials. So if you were comparing a relatively small number of tests, you should run enough iterations that 1/N is insignificant relative to whatever values of mutual information you end up obtaining. (These will be between 0 and 1, but may vary depending on how good your tests are.)
If you have a large number of tests to compare, you run into a third issue:
Although for the typical test, the error is on the order of 1/N, the error for the most misestimated test may be much larger; if that error exceeds the typical value of mutual information, the tests ranked most useful will merely be the ones most misestimated.
Not knowing how errors in mutual information estimates tend to be distributed, I would reason from Chebyshev’s inequality, which makes no assumptions about this. It suggests that the error should be multiplied by sqrt(T), where T is the number of tests, giving us an error on the order of sqrt(T)/N. So make N large enough that this is small.
Independently of the above, I suggest making up a toy model of your problem, in which you know the true value of all the tests and can run a simulation with a number of iterations that would be prohibitive in the real world. This will give you an idea of what to expect.
Oh, thank you. This was immensely useful. I now will pick some other object of study, and limit myself to a few tests (about 8). I kinda suspected I’ll have to obtain data for as many populations as possible, to estimate between-population variation, and for as many trial specimens as possible, but I didn’t know exactly how to check it for efficiency. Happy winter holidays to you!