Proposal that doesn’t require changing the site to be useful:
Would it be possible to obtain a data dump of a matrix of how users have scored posts?
E.g. one row per user i (no id), one column per post j, and the i,j entry is +1, −1, 0 according to how user i scored post j (up, down, or no score)
Substantially better would be for the i,j entry to be (+1, −1, 0, NA) (up, down, no score but visited the post, no score b/c never visited the post).
Obvious objection: privacy concerns
Obvious response: prepare an algorithm and ship it to the Less Wrong server and run it there and only report back summaries that somebody trustworthy vouches for as sufficiently clean.
Examples like this: Netflix prize
Objectives: (1) post recommendation, (2) curation of a list of “best posts” with more refinement than just “posts with highest score”; e.g. best posts for different “kinds” of users, or for some sort of high-karma consensus user.
Aside: we can do better than a Netflix analysis. They phrased the problem wrong:
predicting how you would score a post given (1) your scores on posts you have read and (2) that you have also read the post we are “predicting” for
better: predicting how you would score a post given (1) your scores on posts you have read before some date and (2) that the post of interest is a “future” post that you may or may not actually read in the absence of an intervention that makes you read it (trickier; subtly different; needs a prior / model or better data much more)
Observation: the scoring data isn’t public, but the who commented on what, effectively is.
Available alternatives: look what posts users you like wrote. Others?
Proposal that doesn’t require changing the site to be useful:
Would it be possible to obtain a data dump of a matrix of how users have scored posts? E.g. one row per user i (no id), one column per post j, and the i,j entry is +1, −1, 0 according to how user i scored post j (up, down, or no score)
Substantially better would be for the i,j entry to be (+1, −1, 0, NA) (up, down, no score but visited the post, no score b/c never visited the post).
Obvious objection: privacy concerns
Obvious response: prepare an algorithm and ship it to the Less Wrong server and run it there and only report back summaries that somebody trustworthy vouches for as sufficiently clean.
Examples like this: Netflix prize
Objectives: (1) post recommendation, (2) curation of a list of “best posts” with more refinement than just “posts with highest score”; e.g. best posts for different “kinds” of users, or for some sort of high-karma consensus user.
Aside: we can do better than a Netflix analysis. They phrased the problem wrong:
predicting how you would score a post given (1) your scores on posts you have read and (2) that you have also read the post we are “predicting” for
better: predicting how you would score a post given (1) your scores on posts you have read before some date and (2) that the post of interest is a “future” post that you may or may not actually read in the absence of an intervention that makes you read it (trickier; subtly different; needs a prior / model or better data much more)
Observation: the scoring data isn’t public, but the who commented on what, effectively is.
Available alternatives: look what posts users you like wrote. Others?