Probably part of the intuition motivating something more like average log odds rather than average probabilities is that averaging probabilities seems to ignore extreme probabilities.
Counter-example:
9 out of 10 people give a 1:100,000,000 probability estimate of winning the lottery by picking random numbers, the last person gives a 1:10 estimate. Averaging the probabilities gives a 1:100 estimate, and you foolishly conclude these are great odds given how cheap lottery tickets are.
Yes, context matters. If you have background knowledge that the true probability is fairly well known but a few people are completely wrong then you should certainly not just average probabilities. Something like a trimmed median would be far better in that case.
On the other hand, some other questions may be of the sort where experts give much higher odds than most people. Maybe something like “what is the probability that within 12 months you are infected with a virus that includes the following base sequence”, where the 9 people look at the length of the given base sequence, estimate average virus genome size, and give odds on the order of 1:100,000,000. The tenth looked up a viral genome database and found that it’s in all known variants of SARS-CoV-2, and estimated 1:10 odds.
If you don’t know anything about the context, then you can’t distinguish these scenarios just based on the numbers in them. You can’t even reasonably say that there’s some underlying distribution of types of contexts and you can do some sort of average over them.
Counter-example:
9 out of 10 people give a 1:100,000,000 probability estimate of winning the lottery by picking random numbers, the last person gives a 1:10 estimate. Averaging the probabilities gives a 1:100 estimate, and you foolishly conclude these are great odds given how cheap lottery tickets are.
Yes, context matters. If you have background knowledge that the true probability is fairly well known but a few people are completely wrong then you should certainly not just average probabilities. Something like a trimmed median would be far better in that case.
On the other hand, some other questions may be of the sort where experts give much higher odds than most people. Maybe something like “what is the probability that within 12 months you are infected with a virus that includes the following base sequence”, where the 9 people look at the length of the given base sequence, estimate average virus genome size, and give odds on the order of 1:100,000,000. The tenth looked up a viral genome database and found that it’s in all known variants of SARS-CoV-2, and estimated 1:10 odds.
If you don’t know anything about the context, then you can’t distinguish these scenarios just based on the numbers in them. You can’t even reasonably say that there’s some underlying distribution of types of contexts and you can do some sort of average over them.