I think the first order of business is to straighten out the notation, and what is known.
A—measurement from algorithm A on object O
B—measurement from algorithm B on object O
P(Q|I) - The probability you assign to Q based on some unspecified information I.
Use these to assign P(Q | A,B,O,I).
You have 2 independent measurements of object O,
I think that’s a very bad word to use here. A,B are not independent, they’re different. The trick is coming up with their joint distribution, so that you can evaluate P(Q | A,B,O,I).
The correlation between the opinions of the experts is unknown, but probably small.
If the correlation is small, your detectors suck. I doubt that’s really what’s happening. The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.
We need to identify some assumptions about the accuracy of A and B, and their joint distribution. A and B aren’t just numbers, they’re probability estimates. They were constructed so that they would be correlated with Q. How do we express P(QAB|O)? What information do we start with in this regard?
For a normal problem, you have some data {O_i} where you can evaluate P(A), your detector, versus Q and get the expectation of Q given A. Same for B.
The maximum entropy solution would proceed assuming that these statistics were the only information you had—or that you no longer had the data, but only had some subset of expectations evaluated in this fashion. I think Jaynes found the maximum entropy solution for two measurements which correlate to the same signal. I don’t think he did it in a mixture of experts context, although the solution should be about the same.
If instead you have all the data, the problem is equally straightforward. Evaluate the expectation of Q given A,B across your data set, and apply on new data. Done. Yes, there’s a regularization issue, but it’s a 2-d → 1-d supervised classification problem. If you’re training A and B as well, do that in combination with this 2-d->1d problem as a stacked generalization problem, to avoid over fitting.
The issue is exactly what data are you working from. Can you evaluate A and B across all data, or do you just have statistics (or assumptions expressed as statistics) on A and B across the data?
If the correlation is small, your detectors suck. I doubt that’s really what’s happening. The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.
The way I interpreted the claim of independence is that the verdicts of the experts are not correlated once you conditionalize on Q. If that is the case, then DanielLC’s procedure gives the correct answer.
To see this more explicitly, suppose that expert A’s verdict is based on evidence Ea and expert B’s verdict is based on evidence Eb. The independence assumption is that P(Ea & Eb|Q) = P(Ea|Q) * P(Eb|Q).
Since we know the posteriors P(Q|Ea) and P(Q|Eb), and we know the prior of Q, we can calculate the likelihood ratios for Ea and Eb. The independence assumption allows us to multiply these likelihood ratios together to obtain a likelihood ratio for the combined evidence Ea & Eb. We then multiply this likelihood ratio with the prior odds to obtain the correct posterior odds.
To see this more explicitly, suppose that expert A’s verdict is based on evidence Ea and expert B’s verdict is based on evidence Eb. The independence assumption is that P(Ea & Eb|Q) = P(Ea|Q) * P(Eb|Q).
You can write that, and it’s likely possible in some cases, but step back and think, Does this really make sense to say in the general case?
I just don’t think so. The whole problem with mixture of experts, or combining multiple data sources, is that the marginals are not in general independent.
Sure, it’s not generically true, but PhilGoetz is thinking about a specific application in which he claims that it is justified to regard the expert estimates as independent (conditional on Q, of course). I don’t know enough about the relevant domain to assess his claim, but I’m willing to take him at his word.
I was just responding to your claim that the detectors must suck if the correlation is small. That would be true if the unconditional correlation were small, but its not true if the correlation is small conditional on Q.
The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.
This need not be the case. Consider a random variable Z that is the sum of two random independent variables X and Y. Expert A knows X, and is thus correlated with Z. Expert B knows Y and is thus correlated with Z. Expert A and B can still be uncorrelated. In fact, you can make X and Y slightly anticorrelated, and still have them both be positively correlated with Z.
Just consider the limiting case—both are perfect predictors of Q, with value 1 for Q, and value 0 for not Q. And therefore, perfectly correlated.
Consider small deviations from those perfect predictors. The correlation would still be large. Sometimes more, sometimes less, depending on the details of both predictors. Sometimes they will be more correlated with each other than with Q, sometimes more correlated with Q than each other. The degree of correlation with of A and B with Q will impose limits on the degree of correlation between A and B.
And of course, correlation isn’t really the issue here anyway, much more like mutual information, with the same sort of triangle inequality limits to the mutual information.
If someone is feeling energetic and really wants to work this our, I’d recommend looking into triangle inequalities for mutual information measures, and the previously mentioned work by Jaynes on the maximum entropy estimate of a variable from it’s known correlation with two other variables, and how that constrains the maximum entropy estimate of the correlation between the other two.
I think the first order of business is to straighten out the notation, and what is known.
A—measurement from algorithm A on object O
B—measurement from algorithm B on object O
P(Q|I) - The probability you assign to Q based on some unspecified information I.
Use these to assign P(Q | A,B,O,I).
I think that’s a very bad word to use here. A,B are not independent, they’re different. The trick is coming up with their joint distribution, so that you can evaluate P(Q | A,B,O,I).
If the correlation is small, your detectors suck. I doubt that’s really what’s happening. The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.
We need to identify some assumptions about the accuracy of A and B, and their joint distribution. A and B aren’t just numbers, they’re probability estimates. They were constructed so that they would be correlated with Q. How do we express P(QAB|O)? What information do we start with in this regard?
For a normal problem, you have some data {O_i} where you can evaluate P(A), your detector, versus Q and get the expectation of Q given A. Same for B.
The maximum entropy solution would proceed assuming that these statistics were the only information you had—or that you no longer had the data, but only had some subset of expectations evaluated in this fashion. I think Jaynes found the maximum entropy solution for two measurements which correlate to the same signal. I don’t think he did it in a mixture of experts context, although the solution should be about the same.
If instead you have all the data, the problem is equally straightforward. Evaluate the expectation of Q given A,B across your data set, and apply on new data. Done. Yes, there’s a regularization issue, but it’s a 2-d → 1-d supervised classification problem. If you’re training A and B as well, do that in combination with this 2-d->1d problem as a stacked generalization problem, to avoid over fitting.
The issue is exactly what data are you working from. Can you evaluate A and B across all data, or do you just have statistics (or assumptions expressed as statistics) on A and B across the data?
The way I interpreted the claim of independence is that the verdicts of the experts are not correlated once you conditionalize on Q. If that is the case, then DanielLC’s procedure gives the correct answer.
To see this more explicitly, suppose that expert A’s verdict is based on evidence Ea and expert B’s verdict is based on evidence Eb. The independence assumption is that P(Ea & Eb|Q) = P(Ea|Q) * P(Eb|Q).
Since we know the posteriors P(Q|Ea) and P(Q|Eb), and we know the prior of Q, we can calculate the likelihood ratios for Ea and Eb. The independence assumption allows us to multiply these likelihood ratios together to obtain a likelihood ratio for the combined evidence Ea & Eb. We then multiply this likelihood ratio with the prior odds to obtain the correct posterior odds.
You can write that, and it’s likely possible in some cases, but step back and think, Does this really make sense to say in the general case?
I just don’t think so. The whole problem with mixture of experts, or combining multiple data sources, is that the marginals are not in general independent.
Sure, it’s not generically true, but PhilGoetz is thinking about a specific application in which he claims that it is justified to regard the expert estimates as independent (conditional on Q, of course). I don’t know enough about the relevant domain to assess his claim, but I’m willing to take him at his word.
I was just responding to your claim that the detectors must suck if the correlation is small. That would be true if the unconditional correlation were small, but its not true if the correlation is small conditional on Q.
This need not be the case. Consider a random variable Z that is the sum of two random independent variables X and Y. Expert A knows X, and is thus correlated with Z. Expert B knows Y and is thus correlated with Z. Expert A and B can still be uncorrelated. In fact, you can make X and Y slightly anticorrelated, and still have them both be positively correlated with Z.
Just consider the limiting case—both are perfect predictors of Q, with value 1 for Q, and value 0 for not Q. And therefore, perfectly correlated.
Consider small deviations from those perfect predictors. The correlation would still be large. Sometimes more, sometimes less, depending on the details of both predictors. Sometimes they will be more correlated with each other than with Q, sometimes more correlated with Q than each other. The degree of correlation with of A and B with Q will impose limits on the degree of correlation between A and B.
And of course, correlation isn’t really the issue here anyway, much more like mutual information, with the same sort of triangle inequality limits to the mutual information.
If someone is feeling energetic and really wants to work this our, I’d recommend looking into triangle inequalities for mutual information measures, and the previously mentioned work by Jaynes on the maximum entropy estimate of a variable from it’s known correlation with two other variables, and how that constrains the maximum entropy estimate of the correlation between the other two.