I think I’ve figured this out. I still can’t implement it, since I don’t write PHP, but I’m going to drop it here anyway, to get a sanity check and in case seeing it written out will be useful for anyone else. (Hint, hint.)
Edit: On further examination, I think my conclusion is wrong… or rather, the first set of numbers I got was right, and just not useful for this purpose.
I started with a set of sample data: A horoscope with 2 votes for ‘not useful’ and 3 votes each for ‘sort-of useful’ and ‘useful’, totaling 8 votes.
‘Harmful’ and ‘awesome’ have a lower bound of 0 and an upper bound of .324
‘Not useful’ has a lower bound of .071 and an upper bound of .590
‘Sort of useful’ and ‘useful’ have a lower bound of .137 and an upper bound of .694
I messed around with using those numbers directly to get weighted scores for the horoscopes, but they didn’t work very well that way, so I adjusted each set to add up to 100%. For example, on the lower bound set, ‘not useful’ has (0.071)/(0.071+0.137+0.137)=0.206 of the adjusted votes. Multiplying these adjusted numbers by the weighting that I gave in the original post (-15, −1, +1, +3, +10) gave what look to me like sane numbers: My sample data got an adjusted lower bound score of 1.382 and an adjusted upper bound score of 0.216. (The adjusted upper bound score is low because ‘harmful’ votes are weighted more strongly than ‘awesome’ votes, and the upper bounds for those are closer to the upper bounds of the other types of votes—in other words, the ‘upper bound’ vote is more pessimistic because of how things are weighted.)
I could easily have done something wrong, there, but assuming not and assuming that it’s easy to code a Wilson score interval calculator, this should be simple enough to add to the code once someone gets to it. (Peer appears to have become busy with something else.) The decision on whether to use upper or lower bounds seems to depend on how we want newer horoscopes to act in comparison to older ones, and I think using the lower bound (optimistic) one makes sense—I think new horoscopes should ideally be given a few opportunities to prove themselves at a higher rate of visibility before settling into their accurate place in the hierarchy, rather than having to claw their way up from enforced obscurity.
I just realized that I didn’t mention this in a direct reply to you so I should mention now that this method was not actually Bayesian. It seems to work well enough for whatever sites use it but if you want I could do a Bayesian analysis of the data similar to the one you did.
I don’t really care whether a given method of Baysean or frequentist so long as it works in context. But it looks to me like the point of any predictive method is to give a higher score to something with more votes, and I don’t think that makes sense here—there’s a risk of ending up in a situation where a few dozen horoscopes (based on how often the horoscopes are allowed to repeat; out of hopefully at least a few hundred) have many more votes than the others, because they happened to be at the top of the heap at one point and started getting picked more often by the RNG, which got them more votes, which widened the gap in how often they were chosen, which got them even more votes....
To ensure churn, fix a lower bound on the probability that a horoscope will be picked, at least until it has been picked enough times to accurately rank it against horoscopes that have been picked more often.
That would work, but it’s not obvious to me how to implement it.
Thinking aloud:
The probability of a given horoscope being picked is currently based on the percentage of the sum of the active horoscopes’ scores represented by its score. In other words, if we have 5 active horoscopes, and two have scores of 1 and three have scores of 1.5, the sum is 6.5, the two lower-scored horoscopes have a 1⁄6.5=15.4% chance of being picked, and the three higher-scored horoscopes have a 1.5/6.5=23.0% chance of being picked. But, the script never actually calculates those percentages—it sums the scores, picks a random number in that range, and considers the horoscopes in ID order until the sum of the scores of the considered horoscopes is greater than the random number, at which point it uses the last considered horoscope. (It helps to think in terms of a line, with the different horoscopes taking up different areas along it with their length depending on score, and the random number being a point along the line.)
It could calculate the percentages, but then I’d have to figure out how to have it take weight away from high-percentage horoscopes in a sane way to give it to low-percentage horoscopes, and that sounds hard given just that information.
It could say that a given horoscope’s functional score is the greater of its actual score or some function of the number of votes it has, or the average of those two values. Even assuming that I can come up with a sane function, though, neither of those seem to work too well—using the greater of the two interferes with having negatively-scored horoscopes, and an average seems like it’ll add unwanted noise to the scores of highly-voted horoscopes, unless the function takes the score and/or votes into account in such a way as to approach the value of the score as more votes come in.
Perhaps the Wilson scores can use different confidence levels for different horoscopes, with the confidence level being a function of the number of votes?
Hmm… here’s the scheme I’d use. I’d partition the horoscopes into two classes based on the widths of the Wilson confidence intervals at some specific confidence level. Horoscopes with intervals wider than some threshold are classed as poorly-characterized; otherwise, well-characterized. To select a horoscope, first select a class with probability proportional to the size of the class, e.g., if 20% of horoscopes are currently well-characterized, select the well-characterized horoscopes with probability 20%. If the chosen class is the well-characterized horoscopes, use score-weighted random selection within that class; otherwise, select uniformly at random among the poorly-characterized horoscopes.
Or, more simply, use weighted scores for everything but count the horoscopes with the fewest votes twice?
Maybe consider a horoscope ‘new’ if it has less than 10% as many votes as the most-voted-on horoscope, double the scores of new horoscopes, and calculate normally from there? I have a sneaking suspicion that that will lead to a situation where some horoscopes have enough votes to not be ‘new’ anymore but too low of a score to compete with the others, and wind up stuck—but it may be self-correcting if we check to see if each horoscope is ‘new’ every day, so that when the highest-scoring one comes up to earn votes, any horoscopes that are stuck in the doldrums get a temporary boost. It may also help to apply the ‘new’ status to horoscopes that have 10% of the votes of the 30th most popular horoscope, 30 being a relevant number because it’s the number of days that a horoscope is considered ‘recently used’ if there are more than 60 horoscopes, and thus the fastest one can see a repeat.
Or, maybe have things with <10% as many votes as the relevant high-scorer get their score doubled, and things with 10-20% as many votes as the relevant high-scorer get a smaller bonus? That sounds like it will work… though, after testing, I’m not actually sure it’s necessary. This may be an artifact of the dummy data I used, though—I’ll poke at it more in a bit.
Test1Votes=10 Test1Score=0.24 (arbitrary, based on old test data given roughly-similar voting profile) Test1AdjustedScore=0.48 (about 30% the chance of High—not bad)
The reason why I suggest looking at the width of the Wilson confidence interval instead of looking directly at the number of votes is because the width of the confidence interval is a direct measure of the information we have about a horoscope. It’s hard to reason about what is likely to happen when addressing amount of votes; what we really care about is the precision with which horoscope quality is known. In particular, learning the quality of extreme horoscopes (either good or bad) takes fewer votes than learning about 50 percenters, a fact which will be reflected in the width of the confidence interval.
There’s a reasonable chance that what you just said will be parseable to Peer, but it goes over my head.
Alternately, some non-us person could do the relevant coding, since it is open-source.
I think I’ve figured this out. I still can’t implement it, since I don’t write PHP, but I’m going to drop it here anyway, to get a sanity check and in case seeing it written out will be useful for anyone else. (Hint, hint.)
Edit: On further examination, I think my conclusion is wrong… or rather, the first set of numbers I got was right, and just not useful for this purpose.
I started with a set of sample data: A horoscope with 2 votes for ‘not useful’ and 3 votes each for ‘sort-of useful’ and ‘useful’, totaling 8 votes.
I then used the wolfram alpha Wilson score interval calculator to get a pair of numbers for each of the 5 vote types, given 95% confidence:
‘Harmful’ and ‘awesome’ have a lower bound of 0 and an upper bound of .324
‘Not useful’ has a lower bound of .071 and an upper bound of .590
‘Sort of useful’ and ‘useful’ have a lower bound of .137 and an upper bound of .694
I messed around with using those numbers directly to get weighted scores for the horoscopes, but they didn’t work very well that way, so I adjusted each set to add up to 100%. For example, on the lower bound set, ‘not useful’ has (0.071)/(0.071+0.137+0.137)=0.206 of the adjusted votes. Multiplying these adjusted numbers by the weighting that I gave in the original post (-15, −1, +1, +3, +10) gave what look to me like sane numbers: My sample data got an adjusted lower bound score of 1.382 and an adjusted upper bound score of 0.216. (The adjusted upper bound score is low because ‘harmful’ votes are weighted more strongly than ‘awesome’ votes, and the upper bounds for those are closer to the upper bounds of the other types of votes—in other words, the ‘upper bound’ vote is more pessimistic because of how things are weighted.)
I could easily have done something wrong, there, but assuming not and assuming that it’s easy to code a Wilson score interval calculator, this should be simple enough to add to the code once someone gets to it. (Peer appears to have become busy with something else.) The decision on whether to use upper or lower bounds seems to depend on how we want newer horoscopes to act in comparison to older ones, and I think using the lower bound (optimistic) one makes sense—I think new horoscopes should ideally be given a few opportunities to prove themselves at a higher rate of visibility before settling into their accurate place in the hierarchy, rather than having to claw their way up from enforced obscurity.
Numbers:
actual votes
score 1.25
-
lower bound 95% confidence
score (total weighted score divided by sum of Wilson numbers): 0.48475
-
upper bound 95% confidence
score: 0.2155
-
adjusted lower bound of 95% confidence (keep wilson numbers proportional to each other but make ’em total 1.0)
score: 1.382
-
adjusted upper bound 95% confidence
score: 0.216
I just realized that I didn’t mention this in a direct reply to you so I should mention now that this method was not actually Bayesian. It seems to work well enough for whatever sites use it but if you want I could do a Bayesian analysis of the data similar to the one you did.
I don’t really care whether a given method of Baysean or frequentist so long as it works in context. But it looks to me like the point of any predictive method is to give a higher score to something with more votes, and I don’t think that makes sense here—there’s a risk of ending up in a situation where a few dozen horoscopes (based on how often the horoscopes are allowed to repeat; out of hopefully at least a few hundred) have many more votes than the others, because they happened to be at the top of the heap at one point and started getting picked more often by the RNG, which got them more votes, which widened the gap in how often they were chosen, which got them even more votes....
To ensure churn, fix a lower bound on the probability that a horoscope will be picked, at least until it has been picked enough times to accurately rank it against horoscopes that have been picked more often.
That would work, but it’s not obvious to me how to implement it.
Thinking aloud:
The probability of a given horoscope being picked is currently based on the percentage of the sum of the active horoscopes’ scores represented by its score. In other words, if we have 5 active horoscopes, and two have scores of 1 and three have scores of 1.5, the sum is 6.5, the two lower-scored horoscopes have a 1⁄6.5=15.4% chance of being picked, and the three higher-scored horoscopes have a 1.5/6.5=23.0% chance of being picked. But, the script never actually calculates those percentages—it sums the scores, picks a random number in that range, and considers the horoscopes in ID order until the sum of the scores of the considered horoscopes is greater than the random number, at which point it uses the last considered horoscope. (It helps to think in terms of a line, with the different horoscopes taking up different areas along it with their length depending on score, and the random number being a point along the line.)
It could calculate the percentages, but then I’d have to figure out how to have it take weight away from high-percentage horoscopes in a sane way to give it to low-percentage horoscopes, and that sounds hard given just that information.
It could say that a given horoscope’s functional score is the greater of its actual score or some function of the number of votes it has, or the average of those two values. Even assuming that I can come up with a sane function, though, neither of those seem to work too well—using the greater of the two interferes with having negatively-scored horoscopes, and an average seems like it’ll add unwanted noise to the scores of highly-voted horoscopes, unless the function takes the score and/or votes into account in such a way as to approach the value of the score as more votes come in.
Perhaps the Wilson scores can use different confidence levels for different horoscopes, with the confidence level being a function of the number of votes?
Hmm… here’s the scheme I’d use. I’d partition the horoscopes into two classes based on the widths of the Wilson confidence intervals at some specific confidence level. Horoscopes with intervals wider than some threshold are classed as poorly-characterized; otherwise, well-characterized. To select a horoscope, first select a class with probability proportional to the size of the class, e.g., if 20% of horoscopes are currently well-characterized, select the well-characterized horoscopes with probability 20%. If the chosen class is the well-characterized horoscopes, use score-weighted random selection within that class; otherwise, select uniformly at random among the poorly-characterized horoscopes.
Or, more simply, use weighted scores for everything but count the horoscopes with the fewest votes twice?
Maybe consider a horoscope ‘new’ if it has less than 10% as many votes as the most-voted-on horoscope, double the scores of new horoscopes, and calculate normally from there? I have a sneaking suspicion that that will lead to a situation where some horoscopes have enough votes to not be ‘new’ anymore but too low of a score to compete with the others, and wind up stuck—but it may be self-correcting if we check to see if each horoscope is ‘new’ every day, so that when the highest-scoring one comes up to earn votes, any horoscopes that are stuck in the doldrums get a temporary boost. It may also help to apply the ‘new’ status to horoscopes that have 10% of the votes of the 30th most popular horoscope, 30 being a relevant number because it’s the number of days that a horoscope is considered ‘recently used’ if there are more than 60 horoscopes, and thus the fastest one can see a repeat.
Or, maybe have things with <10% as many votes as the relevant high-scorer get their score doubled, and things with 10-20% as many votes as the relevant high-scorer get a smaller bonus? That sounds like it will work… though, after testing, I’m not actually sure it’s necessary. This may be an artifact of the dummy data I used, though—I’ll poke at it more in a bit.
Sanity check data:
HighVote=100
HighHarmful=3
HighUseless=17
HighSOUseful=35
HighUseful=40
HighAwesome=5
Lower-bound adjusted numbers, 95% confidence
AdjHarmful=0.010 −15 = −0.15
AdjUseless=0.109 −1 = −0.109
AdjSOUseful=0.264
AdjUseful=0.309 3 = 0.927
AdjAwesome=0.022 10 = 0.22
Total adj. votes=0.714
Total points=1.152
HighScore=1.613
Test1Votes=10
Test1Score=0.24 (arbitrary, based on old test data given roughly-similar voting profile)
Test1AdjustedScore=0.48 (about 30% the chance of High—not bad)
Test2Votes=15
T2Harmful=1, adjusted=0.012, weighted=-0.18
T2Useless=2, adjusted=0.037, weighted=-0.037
T2SOUseful=5, adjusted=0.152, weighted=0.152
T2Useful=6, adjusted=0.198, weighted=0.594
T2Awesome=1, adjusted=0.012, weighted=0.12
vote total:0.411
weighted total:0.649
base score: 1.579 (probably doesn’t need adjusting, actually)
The reason why I suggest looking at the width of the Wilson confidence interval instead of looking directly at the number of votes is because the width of the confidence interval is a direct measure of the information we have about a horoscope. It’s hard to reason about what is likely to happen when addressing amount of votes; what we really care about is the precision with which horoscope quality is known. In particular, learning the quality of extreme horoscopes (either good or bad) takes fewer votes than learning about 50 percenters, a fact which will be reflected in the width of the confidence interval.
That does make sense. It doesn’t help that each horoscope has 5 intervals, though. Maybe look at the narrowest one for each horoscope?
That seems reasonable.