(I’m a member of the LW team, but this is an area where we still have a lot of uncertainty, so we don’t necessarily agree internally and our thinking is likely to change.)
There are three proposed changes being bundled together here: (1) The guidance given about how to vote; (2) the granularity of the votes elicited; and (3) how votes are aggregated and presented to readers.
As you correctly observe, votes are serving multiple purposes: it gives information to other readers about what’s worth their time to read, it gives readers information about what other people are reading, and it gives authors feedback about whether they did a good job. Sometimes these come apart; for example, if someone helpfully clears up a confusion that only one person had, then their comment should receive positive feedback, but isn’t worth reading for most people.
These things are, in practice, pretty tightly correlated, especially when judged by voters who are only spending a little bit of time on each vote. And that seems like the root issue: disentangling “how I feel about this post” from “is this post worth reading” requires more time and distance than is currently going into voting. One idea I’m considering, is retrospective voting: periodically show people a list of things they’ve read in the past (say, the past week), and ask people to rate them then. This would be less noisy, because it elicits comparisons rather than ups/downs in isolation, and it might also change people’s votes in a good way by giving them some distance.
Switching from the current up/down/super-up/super-down to 0-100% range voting, seems like the main effect is it’s creating a distinction between implicit and explicit neutral votes. That is, currently if people feel something is meh, they don’t vote, but in the proposed system they would instead give it a middling score. The advantage of this is that you can aggregate scores in a way that measures quality, without being as conflated with attention; right now if a post/comment has been read more times, it gets more votes, and we don’t have a good way of distinguishing this from a post/comment with fewer reads but more votes per reader.
But I’m skeptical of whether people will actually cast explicit neutral votes, in most cases; that would require them to break out of skimming, slow down, and make a lot more explicit decisions than they currently do. A more promising direction might be to collect more granular data on scroll positions and timings, so that we can estimate the number of people who read a comment and skimmed a comment without voting, and use that as an input into scoring.
The third thing is aggregation—how we convert a set of votes into a sort-order to guide readers to the best stuff—which is an aspect of the current system I’m currently least satisfied with. That includes things like karma-weighting of votes, and also the handling of polarizing posts. In the long term, I’m hoping to generate a dataset of pairwise comparisons by trusted users, which we can use as a ground truth to test algorithms against. But polarizing posts will always be difficult to score, because the votes reflect an underlying disagreement between humans and the answer to whether a post should be shown may depend on things the voters haven’t evaluated, like the truth of the post’s claims.
But I’m skeptical of whether people will actually cast explicit neutral votes, in most cases; that would require them to break out of skimming, slow down, and make a lot more explicit decisions than they currently do. A more promising direction might be to collect more granular data on scroll positions and timings, so that we can estimate the number of people who read a comment and skimmed a comment without voting, and use that as an input into scoring.
This is very much a problem in collecting NPS data in its original context, too: you get lots of data from upset customers and happy customer and meh customers stay silent. You can do some interpolation about what missing votes mean, and coupled with scrolling behavior you could get some sense of read count that you could use to make adjustments, but that obviously makes things a bit more complicated.
(I’m a member of the LW team, but this is an area where we still have a lot of uncertainty, so we don’t necessarily agree internally and our thinking is likely to change.)
There are three proposed changes being bundled together here: (1) The guidance given about how to vote; (2) the granularity of the votes elicited; and (3) how votes are aggregated and presented to readers.
As you correctly observe, votes are serving multiple purposes: it gives information to other readers about what’s worth their time to read, it gives readers information about what other people are reading, and it gives authors feedback about whether they did a good job. Sometimes these come apart; for example, if someone helpfully clears up a confusion that only one person had, then their comment should receive positive feedback, but isn’t worth reading for most people.
These things are, in practice, pretty tightly correlated, especially when judged by voters who are only spending a little bit of time on each vote. And that seems like the root issue: disentangling “how I feel about this post” from “is this post worth reading” requires more time and distance than is currently going into voting. One idea I’m considering, is retrospective voting: periodically show people a list of things they’ve read in the past (say, the past week), and ask people to rate them then. This would be less noisy, because it elicits comparisons rather than ups/downs in isolation, and it might also change people’s votes in a good way by giving them some distance.
Switching from the current up/down/super-up/super-down to 0-100% range voting, seems like the main effect is it’s creating a distinction between implicit and explicit neutral votes. That is, currently if people feel something is meh, they don’t vote, but in the proposed system they would instead give it a middling score. The advantage of this is that you can aggregate scores in a way that measures quality, without being as conflated with attention; right now if a post/comment has been read more times, it gets more votes, and we don’t have a good way of distinguishing this from a post/comment with fewer reads but more votes per reader.
But I’m skeptical of whether people will actually cast explicit neutral votes, in most cases; that would require them to break out of skimming, slow down, and make a lot more explicit decisions than they currently do. A more promising direction might be to collect more granular data on scroll positions and timings, so that we can estimate the number of people who read a comment and skimmed a comment without voting, and use that as an input into scoring.
The third thing is aggregation—how we convert a set of votes into a sort-order to guide readers to the best stuff—which is an aspect of the current system I’m currently least satisfied with. That includes things like karma-weighting of votes, and also the handling of polarizing posts. In the long term, I’m hoping to generate a dataset of pairwise comparisons by trusted users, which we can use as a ground truth to test algorithms against. But polarizing posts will always be difficult to score, because the votes reflect an underlying disagreement between humans and the answer to whether a post should be shown may depend on things the voters haven’t evaluated, like the truth of the post’s claims.
This is very much a problem in collecting NPS data in its original context, too: you get lots of data from upset customers and happy customer and meh customers stay silent. You can do some interpolation about what missing votes mean, and coupled with scrolling behavior you could get some sense of read count that you could use to make adjustments, but that obviously makes things a bit more complicated.