If you don’t get any information after the fact on whether O was Q or not, there’s not one right way to do it. JRMayne’s recommendation of averaging the expert judgments works, as does DanielLC’s recommendation of assuming that the experts are entirely uncorrelated. The trouble with assuming they’re uncorrelated is that it can give you pretty extreme probability estimates- but if you’re just making decisions based on some middling threshold (“call it a Q if P(Q)>.5”) then you don’t have to worry about extreme probability estimates! If you make decisions based on an extreme threshold (“call it a Q if P(Q)>.99″), then you have to worry. One of the things that might be helpful is plotting what these formula will result in A,B space, and seeing if that graph looks like what you / experts in this domain would expect.
If you do get information after the fact, you’ll want to use what’s called a Bayesian Judge. Basically, it learns P(Q(O)|A,B,P(Q)) through Bayesian updates; you’re building an expert that says “if I consider all of the n times A said a and B said b, nP times it turned out to be Q, so P(Q)=P.”
The other neat thing about Bayesian judges is that they fix calibration problems with experts- it will quickly learn that when they say .9, they actually mean .7.
The trouble with the Bayesian judge is that it will starve if you can’t feed it data on whether or not O was Q. I won’t type up the necessary math unless this fits your situation, but if it does I’d be happy to.
The trouble with assuming they’re uncorrelated is that it can give you pretty extreme probability estimates
No. The trouble with assuming they’re uncorrelated is that they probably aren’t. If they were, the extreme probability estimates would be warranted.
I suppose more accurately, the problem is that if there is a significant correlation, assuming they’re uncorrelated will give a, equally significant error, and they’re usually significantly correlated.
No. The trouble with assuming they’re uncorrelated is that they probably aren’t. If they were, the extreme probability estimates would be warranted.
This is what I meant by extreme- further than warranted.
The subtler point was that the penalty for being extreme, in a decision-making context, depends on your threshold. Suppose you just want to know whether or not your posterior should be higher than your prior. Then, the experts saying “A>P(Q)” and “B>P(Q)” means that you vote “higher,” regardless of your aggregation technique, and if the experts disagree, you go with the one that feels more strongly (if you have no data on which one is more credible).
Again, if the threshold is higher, but not significantly higher, it may be that both aggregation techniques give the same results. One of the benefits of graphing them is that it will make the regions where the techniques disagree obvious- if A says .9 and B says .4 (with a prior of .3), then what do the real-world experts think this means? Choosing between the methods should be done by focusing on the differences caused by that choice (though first-principles arguments about correlation can be useful too).
If you don’t get any information after the fact on whether O was Q or not, there’s not one right way to do it. JRMayne’s recommendation of averaging the expert judgments works, as does DanielLC’s recommendation of assuming that the experts are entirely uncorrelated. The trouble with assuming they’re uncorrelated is that it can give you pretty extreme probability estimates- but if you’re just making decisions based on some middling threshold (“call it a Q if P(Q)>.5”) then you don’t have to worry about extreme probability estimates! If you make decisions based on an extreme threshold (“call it a Q if P(Q)>.99″), then you have to worry. One of the things that might be helpful is plotting what these formula will result in A,B space, and seeing if that graph looks like what you / experts in this domain would expect.
If you do get information after the fact, you’ll want to use what’s called a Bayesian Judge. Basically, it learns P(Q(O)|A,B,P(Q)) through Bayesian updates; you’re building an expert that says “if I consider all of the n times A said a and B said b, nP times it turned out to be Q, so P(Q)=P.”
The other neat thing about Bayesian judges is that they fix calibration problems with experts- it will quickly learn that when they say .9, they actually mean .7.
The trouble with the Bayesian judge is that it will starve if you can’t feed it data on whether or not O was Q. I won’t type up the necessary math unless this fits your situation, but if it does I’d be happy to.
No. The trouble with assuming they’re uncorrelated is that they probably aren’t. If they were, the extreme probability estimates would be warranted.
I suppose more accurately, the problem is that if there is a significant correlation, assuming they’re uncorrelated will give a, equally significant error, and they’re usually significantly correlated.
This is what I meant by extreme- further than warranted.
The subtler point was that the penalty for being extreme, in a decision-making context, depends on your threshold. Suppose you just want to know whether or not your posterior should be higher than your prior. Then, the experts saying “A>P(Q)” and “B>P(Q)” means that you vote “higher,” regardless of your aggregation technique, and if the experts disagree, you go with the one that feels more strongly (if you have no data on which one is more credible).
Again, if the threshold is higher, but not significantly higher, it may be that both aggregation techniques give the same results. One of the benefits of graphing them is that it will make the regions where the techniques disagree obvious- if A says .9 and B says .4 (with a prior of .3), then what do the real-world experts think this means? Choosing between the methods should be done by focusing on the differences caused by that choice (though first-principles arguments about correlation can be useful too).