Help with a (potentially Bayesian) statistics / set theory problem?
Update: as it turns out, this is a voting system problem, which is a difficult but well-studied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.
I’m working on a problem that I believe involves Bayes, I’m new to Bayes and a bit rusty on statistics, and I’m having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.
Here’s the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no “indifferent” option.) The order of these combinations is not important, and repeated combinations are not allowed.
Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (7-2)! ) = 21.
Now, assume the researcher wants to know which single option has the highest probability of being the “most preferred” to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.
At the beginning of the study, there are no priors. (CORRECTION via dlthomas: “There are necessarily priors… we start with no information about rankings… and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the “most preferred” object should go up, and A should go down, if I’m understanding correctly.
NOTE: Direct ranking of objects 1-7 (instead of pairwise comparison) isn’t ideal because it takes longer, which may encourage the participant to rationalize. The “pick-one-of-two” approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.
The ideal output looks like this: “Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred).”
Questions:
1. Is Bayes actually the most straightforward way of calculating the “most preferred”? (If not, what is? I don’t want to be Maslow’s “man with a hammer” here.)
2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?
Thanks in advance!
It’s not a Bayesian method, but you could use Randall Munroe’s algorithm, probably adding on the assumption of transitivity for each participant except where it is explicitly violated. The way that works is:
Step 1: you say that participant 1 has A>B if 1) they explicitly ranked A>B, or if 2) they have a transitive path giving A>B and do not have a transitive path giving B>A. Repeat for each pair of alternatives for each participant. (note that a participant may have A=B)
Step 2: you say that the group has A>B if more participants have A>B than have B>A. Repeat for each pair of alternatives. (note that the group may have A=B)
Step 3: you count up how many alternatives A is preferred to by the group, and divide by that number plus the number of objects that are preferred to A. This is A’s score. Repeat for each alternative.
If you only care about which object is most likely to be #1, and not any other ranks, then another method is to look for each participant only at the objects that are undefeated. Each object that was never rejected by participant 1 would get a score of 1/(# of undefeated objects), and the other objects would get a score of 0. Then you just sum the scores across participants. Or, you might want not want to give the same score to an object that the participant never saw as you do to one that they preferred three times, so you could weight by number of times selected; perhaps give the object a score of (1 + # of objects it was preferred to)/(sum numerators), where you divide by the sum to normalize each participant’s scores to add to one.
Wow, that’s so simple it could possibly work!
Many thanks—this is most likely what I’ll go with. I appreciate your help. :-)
Even if every participant answered all 21 questions, and if every participant answers with respect to some total ordering, then this still only reduces to the problem of preferential voting. This is a really hard problem and I’m not aware of any voting system that’s provably “correct”—in fact, there are results that for some definitions of “correct”, there are no correct voting systems period. There are various voting systems that satisfy some, but not all, reasonable properties you could name, and you should pick one.
“Ranked pairs” is one natural choice (I don’t know how well it performs in practice, though) because it deals in pairs to begin with. Basically, for each pair (A,B) you tally how many participants picked A over B. You go through the pairs in order of how significant a majority they represent. In this case, you wouldn’t want to use the usual method for deciding which majority is better, because you have to take into account the number of people who answered the question, too. There was a post on LW once which linked to an article about this (in the context of e.g. ratings of movies on IMDB, where a movie rated 10.0 by 5 users isn’t as good as a movie rated 9.0 by 100 users). Anyway, the next thing you do with the pairs is lock them in one by one, except when it would create a cycle together with pairs that have already been locked in. If you do this, the “locked in” pairs will show the most preferred object.
Upvoted. This problem is the well-studied voting problem.
In particular, the seminal result is Arrow’s impossibility theorem, which states that no voting system satisfies all of the following criteria: (from Wikipedia)
Thanks—the voting system analogy didn’t occur to me. Reading up on ranked pairs: http://en.wikipedia.org/wiki/Ranked_pairs
If we allow inconsistency, i.e. if you plan on using this in the real world, then you could get the response B>A, A>C, C>B. That is, there may not be any such thing as a most preferred object, and thus no such thing as the “probability of being the ‘most preferred’ object.”
A workaround would be to ask a related question: “if object 1 is A, and object 2 is unknown, what is the probability that A>2?” The most preferred object would just be the object with the highest answer to this question.
The alternate path would be to assume that there is a preference ordering from 1 to 7, with some sort of noise applied. This feels clunky to me, though.
While clunky, this seems to be a perfectly workable approach for 7 objects. There are 5040 permutations. For each piece of evidence, add probability mass to those that correspond and remove it from those that differ (perhaps weighted by how much they correspond or differ?). The probability of your object being most preferred, then, is the sum of the probabilities of those permutations in which it occupies the highest point.
Okay, if A is preferred from { A , [B-G] }, that should add probability mass to [A, [B,...,G] ], where [A, [B,...,G] ] is a ranked set of objects where the first slot is most preferred. That would represent 720 (6!) sets out of 5040.
All other sets (7! − 6! = 4,320) should either stay the same probability or have probability mass removed.
Then, the probability of A being “most preferred” = the sum of the probability mass of all 720 sets that have A as the highest ranked member. Likewise for B through G. Highest total probability mass wins.
Am I understanding that correctly?
I don’t think I’m following you.
We see a new piece of evidence—one of the people prefers C to E
C will be preferred to E in half the lists. Those lists become more probable, the other half become less probable. How much more/less probable depends on how much error you expect to see and of what type.
Repeat on all the data.
You only actually look at the first member when asking the odds that a particular object is there—at which point, yes, you sum up the probability of those 720 sets.
Ah, I see. Instead of updating half the lists, I was updating the 720 sets where C is the #1 preference. Thanks for the clarification.
I don’t know that “most preferred” naturally translates into “preferred to every other option.” Your “related question” seems a perfectly appropriate generalization of which situations with a single dominant object are a special case (with well ordered sets a special case of that).
Good point about inconsistency… I was thinking that individual responses may be inconsistent, but the aggregated responses of the group might reveal a significant preference.
My first crack at this was to use a simple voting system, where B from {A,B} means +1 votes for B, 0 for A, largest score when all participant votes are tallied wins. What messes that up is participants leaving without completing the entire set, which introduces selection bias, even if the sets are served at random.
Preference ordering / ranking isn’t ideal because it takes longer, which may encourage the participant to rationalize. The “pick-one” approach is designed to be fast, which is better for gut reactions when comparing words, photos, etc.
If the aggregated preferences are transitive (i.e., ‘not inconsistent’ in your and Manfred’s wording), then this preference relation defines a total order on the objects, and there is a unique object that is preferred to every other object (in aggregate). (Furthermore, this is isomorphic to the set {1,2,3,...,7} under the ≤ relation.)
As I understand things, you have no guarantee of transitivity in the aggregated preferences even if you do have transitivity in the individual preferences.
Yes, of course you are correct.
Very helpful—reading about this now. Starting here: http://en.wikipedia.org/wiki/Ranking
I think perfect entropy assumes no correlation between what’s preferred. As such, it would always be impossible to predict.
We’d need some prior for how much correlation there is.
I could be wrong. I don’t know much about perfect entropy.
The posterior distribution is not assumed to be perfect entropy, of course.
If your prior is that there’s no correlation between Alice and Bob, your knowledge about Alice won’t change your knowledge about Bob. If there’s no prior correlation between a correlation between Alice and Bob and one between Charlie and the average of Alice and Bob, the finding out that there’s a correlation between Alice and Bob won’t tell you that Charlie is also correlated.
Basically, I think the maximum entropy is that all of them gave their answers at random.
I don’t understand what Alice et al. are analogous to.
The described experimental setting has 21 ordered pairs of objects, each of them having a preference strength S defined as proportion of people who prefer the first object to the second one.
Either we begin with maxent prior, that is p(S) uniform on interval (0,1) for each pair, and each observation updates only the corresponding pair. Or, if we want to be fully general, we could work with a 21-dimensional distribution p(S1,S2,...,S21); begin again with uniform maxent prior and update accordingly. Given the restricted set of observations (only pairwise preference expressions) both ways are equivalent: the distribution p(S1,...,S21) will always remain separable, equal to p1(S1)p2(S2)...p21(S21). In this sense there is no revealed correlation: knowing that a person prefers object O3 to O6 doesn’t tell us anything about probability of her preference of O1 over O7. However, post-experiment there will still be a non-maxent posterior for preference between O1 and O7. The sentences
as I have interpreted them (i.e. “with maxent prior, we never learn anything about the preferences”) are therefore false.
We will know with certainty the preferences of the people asked. We will have no knowledge of the preferences of people we didn’t.
That’s a fully general argument against statistical inferences, isn’t it? Why bother making surveys—they inform us only about opinions of the participants, giving us no knowledge about the rest of population...
Because we don’t have a maximum entropy prior. We have reason to believe that there is a correlation between people’s opinions. We also have reason to believe that there’s a correlation between correlations. For example, if we survey a bunch of people and their opinions strongly correlate, we can inference that the rest of the population also correlates with them.
For me, a maximal entropy prior is a probability distribution over some reasonable set of hypotheses, such as H(n) = “n percent of people prefer A to B”. In such case, the prior p(H(n)) is uniform over (0,100). If we know that say H(80) is true, we know that a randomly selected person is 80% likely to prefer A over B. A survey enables us to update the prior and eventually locate the correct hypothesis, whatever prior we are starting from. It doesn’t need to explicitly assume any correlation. That 80% of people share an opinion isn’t called correlation between their opinons, in the usual sense of what “correlation” means.
You seem to have somewhat different notion of maximal entropy prior. Perhaps maximal entropy distribution over all possible hypotheses? You seem to imply that with maximum entropy induction is impossible, or something along these lines. I don’t think this is the standard meaning of “maximum entropy prior”.
As I stated at the beginning, I don’t know the standard meaning of maximum entropy prior.
This time when I looked it up I found a simpler definition with finite cases. I’m not sure why I missed that before. I think I can figure out where the confusion is. I was thinking of every possible combination of opinions being separate possibilities. If this is the case, having them all be independent of each other is the maximum entropy. If, on the other hand, you only look at correlation, and consider H(80) = 50 being one case, then maximum entropy would seem to be that H(n) is uniformly distributed.
I don’t think that’s quite right either. I suspect that has something to do with H(n) being continuous instead of discrete. I know the Jeffreys prior for that is beta(1/2,1/2), as opposed to beta(1,1), which is the uniform distribution.
“There are no priors” is not correct—there are necessarily priors. I think what you mean is “we start with no information about rankings” and so would assume perfect entropy—a 1:1 chance of either member of any pair being preferred.
Right—thanks for the correction. Posted a correction in the main text.