Same technical solution I always offer: An upvote or downvote should add or subtract the number of bits of information conveyed by that vote, conditioned on the identity of the voter and the target.
In the simplest version, this would mean that if person X upvotes or downvotes everything written by person Y, those votes count for nothing. If X upvotes half of every comment by person Y, and never downvotes anything by Y, those votes count for nothing (if we assume X missed the comments he didn’t vote on), or up to 1 bit (if we assume X saw all the other comments).
Better would be to use a model that blended X’s voting pattern overall with X’s voting on Y’s posts and comments.
I’m not sure what the exact mathematical proposal here is, but I shall guess the following rule: If X has voted positively on Y p times out of n votes so far, if X’s next vote is an upvote it will confer a karma score of -log((p+1)/(n+2)), and if it is a downvote, log((n-p+1)/(n+2)). X voting positively on each of Y’s n posts will give a total karma of log(n+1), negatively on everything gives -log(n+1). Logs are base 2. Votes never count for nothing, because X’s votes on Y so far are only a sample from which we cannot conclude that X will vote with certainty either way.
This actually rates newbies’ votes above everyone else’s in importance: X’s first vote on Y is always worth the maximum possible, +/- 1.
The general principle of the proposal is that to the extent that you can predict an opinion, you are less incrementally informed by finding out what it is, which as a matter of information theory is true. How far might one take this? For example, it suggests ignoring anyone’s political views once one has identified them. SJWs and NRXs alike would be the first to be tuned out. If they want to be paid attention to they would have to find ways of saying new things, although (since they Have Views that determine all their views on individual things) this is likely to converge on finding new ways to say old things, i.e. writing clickbait. On the reader’s side, one should primarily read people one knows nothing about, at least until one has “solved” them and can predict all their further output well enough to get diminishing returns. Personal relationships likewise: they can’t last if they’re based on novelty. Once you have solved a potential partner, then you can decide whether you want to continue to spend time with them for what they are, rather than what they may be. This is the purpose of the rituals of dating and courtship.
I’m not expressing an opinion for or against this, just following the idea.
ETA: Some mathematical simulation shows that if half of X’s votes on Y are positive, the total karma resulting from those votes by the above rule depends sensitively on the order in which they are made. For example, 20 positive votes out of forty can easily give a total karma of anywhere from about −11 to +11. If all upvotes precede all downvotes, the total is −33.6; if the reverse, +33.6. Also, after a long string of positive votes, a single negative vote cancels out most of the karma, and vice versa. The rule I proposed seems too sensitive to properties of the vote sequence that one might not wish it to be.
I didn’t think of that, but do you think karma shouldn’t depend on the order in which votes are made? Shouldn’t a person who gets 20 downvotes followed by 20 upvotes have higher karma at the end than a person who had 20 upvotes followed by 20 downvotes? The first indicates improvement; the second indicates getting less interesting over time.
I am confused by how you’re doing the computation, though. If half of X’s votes on Y are positive and half are negative, I would expect to compute X’s total contribution to Y as zero. I wouldn’t keep a running sum of X’s contribution to Y’s karma on each thing Y has said. We can also go back and recompute the contribution to previous comments as X makes more comments. But I’d probably rather have an adaptive algorithm so that the score on individual comments reflects the situation at the moment the rating was made.
Even if we did it that way, though, this sensitivity is not a real problem. Nearly every adaptive algorithm or learning algorithm has that kind of sensitivity. It never matters in practice when there’s enough data. Text compression algorithms don’t have drastically different compression ratios if you swap text input blocks around.
Why do you think that? When you have no prior, either assume P(up) = P(down), or (better) use the priors gotten by averaging all votes by all users. That’s standard practice.
What are the odds that every person on LessWrong will see and vote on every comment made by this one person? This is not a real scenario. If you’re worried about it, though, “a model that blended X’s voting pattern overall with X’s voting on Y’s posts and comments” will solve that problem.
(Note: my earlier comment was nonsense, based on a misreading of what Richard wrote.)
That does seem to be what Phil says, but in the the scheme I have in my head after reading Phil’s proposal, things go a little differently. For the avoidance of doubt, I am claiming neither that Phil would want this nor that it’s the right thing to do.
Suppose A votes on something B wrote. They have some history: A has voted +1, 0, −1 on u,v,w of B’s things in the past. Here u+v+w is the total number of things B’s ever written.
I think we probably want to ignore the ones A hasn’t voted on. So we care only about u and w.
What should our prediction be? One simple answer: we assign probabilities proportional to u+1,w+1 to votes +1,-1 on A’s next vote. (This is basically Laplace’s rule of succession, or equivalently it’s what we get if we suppose A’s votes are independently random with unknown fixed probabilities and start with a flat prior on those probabilities.)
We might actually want to start with a different prior on the probabilities, which would mean offsetting u and w by different amounts.
Now along comes A’s vote, which is (let’s call it) a, which is either +1 or −1. The score it produces is—a log(Pr(A votes a | history)); that is, - log (u+1)/(u+w+2) if A votes +1, and + log (w+1)/(u+w+2) if A votes −1. This is added to the score for whatever it is B wrote, and to B’s overall total score.
With this scheme, an upvote always has positive effect and a downvote always has negative effect, but as you make the same vote over and over again it is less and less effective. For instance, suppose A upvotes everything B posts. Then A’s first upvote counts for -log(1/2); the next for -log(2/3); the next for -log(3/4); etc. The total effect of n upvotes (and nothing else) is to contribute log n to B’s score.
There are some things about this that feel a little unsatisfactory. I will mention three. First: although “vote counts for plus or minus number of bits of information conveyed” sounds pretty good, on reflection it feels not-quite-right. The situation is a bit like that of estimating the heads-probability of a biased coin, in which case what you do on each new result is almost to adjust by +- the information you just got but not quite, and the aggregated result is somewhat different. Second: the overall result of a sequence of votes, with this scheme, can depend quite a bit on the order in which they occur, and that doesn’t feel like what we want. Third: the overall result’s dependence on individual votes can actually be “backwards”. If you vote +,-,-,+ you get -log(1/2)+log(1/3)+log(1/2)-log(2/5) = log(5/6), which is negative; but if you vote -,-,-,+ you get +log(1/2)+log(2/3)+log(3/4)-log(1/5) = log(5/4), which is positive!
That seems highly undesirable. Maybe what Phil has in mind avoids these problems without incurring worse ones. The most obvious way to avoid them that I see, though, involves moving a little way away from the “effect of vote is bits” paradigm, as follows.
Implicit in those probability calculations is the model I mentioned above: A’s votes on B are independent Bernoulli with fixed but unknown probability p that each one is up rather than down, and we begin with a flat prior over p. Suppose we stick with that model, and ask what we know about p after some of A’s votes. Then the answer (famously) is that our posterior for p after seeing u upvotes and w downvotes is distributed as Beta(u+1,w+1), whose mean is (u+1)/(u+w+2). So, e.g., our expectation for A’s next vote is (u-w)/(u+w+2). So, e.g., we could take A’s total contribution to B’s score to be exactly this; and do the obvious thing with scores for individual comments and posts: weight each vote by 1/(#votes+2), where #votes in the denominator is the number of times the voter in question has voted on things by the poster in question.
This suggests a broader family of schemes, where each vote is weighted by f(#votes) where f is some other decreasing function. If you feel, as I think I do, that the overall effect of many votes by A on B shouldn’t actually be bounded by a small multiple of the effect of one vote, you might want f to decrease more slowly. Perhaps take f = square root, or something like that.
All of these revised schemes have the property that it’s always better to have more upvotes and fewer downvotes, it’s just that A’s influence on B’s score gets less as A’s votes on B get more numerous. And votes from different people just add. So if someone posts what everyone regards as dreck, all the downvotes they get will in fact hurt them.
(Possible downside: the advantage of using sockpuppets becomes much greater, and therefore presumably also the temptation to use them.)
Why should a posting by someone who everyone else agrees has never had anything useful to say be judged less bad than the same posting by someone who does on occasion post upworthy things?
Oh, I beg your pardon—I misread what you wrote as ”… that thinks everyone posts …” rather than ”… that everyone thinks posts …”, and answered accordingly.
Having now (I hope) read the words you actually wrote, my intuition agrees with yours, but I suspect that it may only be artificial extreme cases that produce such counterintuitive outcomes. I will think about it some more.
Same technical solution I always offer: An upvote or downvote should add or subtract the number of bits of information conveyed by that vote, conditioned on the identity of the voter and the target.
In the simplest version, this would mean that if person X upvotes or downvotes everything written by person Y, those votes count for nothing. If X upvotes half of every comment by person Y, and never downvotes anything by Y, those votes count for nothing (if we assume X missed the comments he didn’t vote on), or up to 1 bit (if we assume X saw all the other comments).
Better would be to use a model that blended X’s voting pattern overall with X’s voting on Y’s posts and comments.
I’m not sure what the exact mathematical proposal here is, but I shall guess the following rule: If X has voted positively on Y p times out of n votes so far, if X’s next vote is an upvote it will confer a karma score of -log((p+1)/(n+2)), and if it is a downvote, log((n-p+1)/(n+2)). X voting positively on each of Y’s n posts will give a total karma of log(n+1), negatively on everything gives -log(n+1). Logs are base 2. Votes never count for nothing, because X’s votes on Y so far are only a sample from which we cannot conclude that X will vote with certainty either way.
This actually rates newbies’ votes above everyone else’s in importance: X’s first vote on Y is always worth the maximum possible, +/- 1.
The general principle of the proposal is that to the extent that you can predict an opinion, you are less incrementally informed by finding out what it is, which as a matter of information theory is true. How far might one take this? For example, it suggests ignoring anyone’s political views once one has identified them. SJWs and NRXs alike would be the first to be tuned out. If they want to be paid attention to they would have to find ways of saying new things, although (since they Have Views that determine all their views on individual things) this is likely to converge on finding new ways to say old things, i.e. writing clickbait. On the reader’s side, one should primarily read people one knows nothing about, at least until one has “solved” them and can predict all their further output well enough to get diminishing returns. Personal relationships likewise: they can’t last if they’re based on novelty. Once you have solved a potential partner, then you can decide whether you want to continue to spend time with them for what they are, rather than what they may be. This is the purpose of the rituals of dating and courtship.
I’m not expressing an opinion for or against this, just following the idea.
ETA: Some mathematical simulation shows that if half of X’s votes on Y are positive, the total karma resulting from those votes by the above rule depends sensitively on the order in which they are made. For example, 20 positive votes out of forty can easily give a total karma of anywhere from about −11 to +11. If all upvotes precede all downvotes, the total is −33.6; if the reverse, +33.6. Also, after a long string of positive votes, a single negative vote cancels out most of the karma, and vice versa. The rule I proposed seems too sensitive to properties of the vote sequence that one might not wish it to be.
I didn’t think of that, but do you think karma shouldn’t depend on the order in which votes are made? Shouldn’t a person who gets 20 downvotes followed by 20 upvotes have higher karma at the end than a person who had 20 upvotes followed by 20 downvotes? The first indicates improvement; the second indicates getting less interesting over time.
I am confused by how you’re doing the computation, though. If half of X’s votes on Y are positive and half are negative, I would expect to compute X’s total contribution to Y as zero. I wouldn’t keep a running sum of X’s contribution to Y’s karma on each thing Y has said. We can also go back and recompute the contribution to previous comments as X makes more comments. But I’d probably rather have an adaptive algorithm so that the score on individual comments reflects the situation at the moment the rating was made.
Even if we did it that way, though, this sensitivity is not a real problem. Nearly every adaptive algorithm or learning algorithm has that kind of sensitivity. It never matters in practice when there’s enough data. Text compression algorithms don’t have drastically different compression ratios if you swap text input blocks around.
This doesn’t work for new posters.
There no good reason for the votes of new posters to count much. If they don’t there are less sockpuppet problems.
Why do you think that? When you have no prior, either assume P(up) = P(down), or (better) use the priors gotten by averaging all votes by all users. That’s standard practice.
So if someone pops up that everyone thinks posts utter dreck and votes accordingly, those votes would count for nothing?
What are the odds that every person on LessWrong will see and vote on every comment made by this one person? This is not a real scenario. If you’re worried about it, though, “a model that blended X’s voting pattern overall with X’s voting on Y’s posts and comments” will solve that problem.
(Note: my earlier comment was nonsense, based on a misreading of what Richard wrote.)
That does seem to be what Phil says, but in the the scheme I have in my head after reading Phil’s proposal, things go a little differently. For the avoidance of doubt, I am claiming neither that Phil would want this nor that it’s the right thing to do.
Suppose A votes on something B wrote. They have some history: A has voted +1, 0, −1 on u,v,w of B’s things in the past. Here u+v+w is the total number of things B’s ever written.
I think we probably want to ignore the ones A hasn’t voted on. So we care only about u and w.
What should our prediction be? One simple answer: we assign probabilities proportional to u+1,w+1 to votes +1,-1 on A’s next vote. (This is basically Laplace’s rule of succession, or equivalently it’s what we get if we suppose A’s votes are independently random with unknown fixed probabilities and start with a flat prior on those probabilities.)
We might actually want to start with a different prior on the probabilities, which would mean offsetting u and w by different amounts.
Now along comes A’s vote, which is (let’s call it) a, which is either +1 or −1. The score it produces is—a log(Pr(A votes a | history)); that is, - log (u+1)/(u+w+2) if A votes +1, and + log (w+1)/(u+w+2) if A votes −1. This is added to the score for whatever it is B wrote, and to B’s overall total score.
With this scheme, an upvote always has positive effect and a downvote always has negative effect, but as you make the same vote over and over again it is less and less effective. For instance, suppose A upvotes everything B posts. Then A’s first upvote counts for -log(1/2); the next for -log(2/3); the next for -log(3/4); etc. The total effect of n upvotes (and nothing else) is to contribute log n to B’s score.
There are some things about this that feel a little unsatisfactory. I will mention three. First: although “vote counts for plus or minus number of bits of information conveyed” sounds pretty good, on reflection it feels not-quite-right. The situation is a bit like that of estimating the heads-probability of a biased coin, in which case what you do on each new result is almost to adjust by +- the information you just got but not quite, and the aggregated result is somewhat different. Second: the overall result of a sequence of votes, with this scheme, can depend quite a bit on the order in which they occur, and that doesn’t feel like what we want. Third: the overall result’s dependence on individual votes can actually be “backwards”. If you vote +,-,-,+ you get -log(1/2)+log(1/3)+log(1/2)-log(2/5) = log(5/6), which is negative; but if you vote -,-,-,+ you get +log(1/2)+log(2/3)+log(3/4)-log(1/5) = log(5/4), which is positive!
That seems highly undesirable. Maybe what Phil has in mind avoids these problems without incurring worse ones. The most obvious way to avoid them that I see, though, involves moving a little way away from the “effect of vote is bits” paradigm, as follows.
Implicit in those probability calculations is the model I mentioned above: A’s votes on B are independent Bernoulli with fixed but unknown probability p that each one is up rather than down, and we begin with a flat prior over p. Suppose we stick with that model, and ask what we know about p after some of A’s votes. Then the answer (famously) is that our posterior for p after seeing u upvotes and w downvotes is distributed as Beta(u+1,w+1), whose mean is (u+1)/(u+w+2). So, e.g., our expectation for A’s next vote is (u-w)/(u+w+2). So, e.g., we could take A’s total contribution to B’s score to be exactly this; and do the obvious thing with scores for individual comments and posts: weight each vote by 1/(#votes+2), where #votes in the denominator is the number of times the voter in question has voted on things by the poster in question.
This suggests a broader family of schemes, where each vote is weighted by f(#votes) where f is some other decreasing function. If you feel, as I think I do, that the overall effect of many votes by A on B shouldn’t actually be bounded by a small multiple of the effect of one vote, you might want f to decrease more slowly. Perhaps take f = square root, or something like that.
All of these revised schemes have the property that it’s always better to have more upvotes and fewer downvotes, it’s just that A’s influence on B’s score gets less as A’s votes on B get more numerous. And votes from different people just add. So if someone posts what everyone regards as dreck, all the downvotes they get will in fact hurt them.
(Possible downside: the advantage of using sockpuppets becomes much greater, and therefore presumably also the temptation to use them.)
That seems pretty reasonable to me.
[EDITED to add: except that what I was saying “seems pretty reasonable” was not in fact what Richard wrote; I misread. See comments below.]
Why should a posting by someone who everyone else agrees has never had anything useful to say be judged less bad than the same posting by someone who does on occasion post upworthy things?
Oh, I beg your pardon—I misread what you wrote as ”… that thinks everyone posts …” rather than ”… that everyone thinks posts …”, and answered accordingly.
Having now (I hope) read the words you actually wrote, my intuition agrees with yours, but I suspect that it may only be artificial extreme cases that produce such counterintuitive outcomes. I will think about it some more.