In How Not To Sort By Average Rating, Evan Miller gives two wrong ways to generate an aggregate rating from a collection of positive and negative votes, and one method he thinks is correct. But the “correct” method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics. A much simpler model based on a prior beta distribution has more solid theoretical foundation and would give more accurate results.
Evan mentions the sad reality that big organizations are using obviously naive methods. In contrast, more dynamic sites such as Reddit have adopted the model he suggested. But I fear that it would cause irreparable damage if the world settles on this solution.
Should anything be done about it? What can be done?
This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).
In the Bayesian model, as in Evan’s model, we assume for every item there is some true probability p of upvoting, representing its quality and the rating we wish to give. Every vote is a Bernoulli trial which gives information on p. The prior for p is the beta distribution with some parameters a and b. After observing n actual votes, of which k are positive, the parameters of the posterior distribution are a+k and b+(n-k), so the posterior mean of p is (a+k)/(a+b+n). This gives the best estimate for the true quality, and reproduces all the desired effects—convergence to the proportion of positive ratings, where items with insufficient data are pulled towards the prior mean.
The specific parameters a and b depend on the quality distribution in the specific system. a/(a+b) is the average quality and can be taken as simply the empirical proportion of positive votes among all votes in the system. a+b is an inverse measure of variance—a high value means most items are average quality, and a low value means items are either extremely good or extremely bad. It is harder to calibrate, but can still be done using the overall data (e.g., MLE from the entire voting data).
For the specific problem of sorting, there are other considerations than mere quality. A comment can be in universal agreement, but not otherwise interesting or notable. These may not deserve as prominent a mention as controversial comments which provoke stronger reactions. For this purpose, the “sorting rating” can be multiplied by some function of the total number of votes, such as the square root. If the identity function is used, this becomes similar to a simple difference between the number of positive and negative votes.
How not to sort by a complicated frequentist formula
In How Not To Sort By Average Rating, Evan Miller gives two wrong ways to generate an aggregate rating from a collection of positive and negative votes, and one method he thinks is correct. But the “correct” method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics. A much simpler model based on a prior beta distribution has more solid theoretical foundation and would give more accurate results.
Evan mentions the sad reality that big organizations are using obviously naive methods. In contrast, more dynamic sites such as Reddit have adopted the model he suggested. But I fear that it would cause irreparable damage if the world settles on this solution.
Should anything be done about it? What can be done?
This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).
In the Bayesian model, as in Evan’s model, we assume for every item there is some true probability p of upvoting, representing its quality and the rating we wish to give. Every vote is a Bernoulli trial which gives information on p. The prior for p is the beta distribution with some parameters a and b. After observing n actual votes, of which k are positive, the parameters of the posterior distribution are a+k and b+(n-k), so the posterior mean of p is (a+k)/(a+b+n). This gives the best estimate for the true quality, and reproduces all the desired effects—convergence to the proportion of positive ratings, where items with insufficient data are pulled towards the prior mean.
The specific parameters a and b depend on the quality distribution in the specific system. a/(a+b) is the average quality and can be taken as simply the empirical proportion of positive votes among all votes in the system. a+b is an inverse measure of variance—a high value means most items are average quality, and a low value means items are either extremely good or extremely bad. It is harder to calibrate, but can still be done using the overall data (e.g., MLE from the entire voting data).
For the specific problem of sorting, there are other considerations than mere quality. A comment can be in universal agreement, but not otherwise interesting or notable. These may not deserve as prominent a mention as controversial comments which provoke stronger reactions. For this purpose, the “sorting rating” can be multiplied by some function of the total number of votes, such as the square root. If the identity function is used, this becomes similar to a simple difference between the number of positive and negative votes.