Digging into the paper, I give them an A for effort—they used some interesting methodologies—but there’s a serious problem with it that destroys many of its conclusions. Here’s 3 different measures they used of a post’s quality:
q’: Quality as determined by blinded users given instructions on how to vote.
p: upvotes / (upvotes + downvotes)
q: Prediction for p, based on bigram frequencies of the post, trained on known p for half the dataset
q is the measure they used for most of their conclusions. Note that it is supposed to represent quality, but is based entirely on bigrams. This doesn’t pass the sniff test. Whatever q measures, it isn’t quality. At best it’s grammaticality. It is more likely a prediction of rating based on the user’s identity (individuals have identifiable bigram counts) or politics (“liberal media” and “death tax” vs. “pro choice” and “hate crime”).
q is a prediction for p. p is a proxy for q’. There is no direct connection between q’ and q—no reason to think they will have any correlation not mediated by p.
R-squared values:
q to p: 0.04 (unless it is a typo when it says “mean R = 0.22” and should actually say “mean R^2 = 0.22″)
q to q’: 0.25
q’ to p: 0.12
First, the R-squared between q’, quality scores by judges, and p, community rating, is 0.12. That’s crap. It means that votes are almost unrelated to post quality.
Next, the strongest correlation is between q and q’, but the maximum possible causal correlation between them is 0.04 * 0.12 = 0.0048, because there is no causal connection between them except p.
That means that q, the machine-learned prediction they use for their study, has an acausal correlation with q’, post quality, that is 50 times stronger than the causal correlation.
In other words, all their numbers are bullshit. They aren’t produced by post quality, nor by user voting patterns. There is something wrong with how they’ve processed their data that has produced an artifactual correlation.
Digging into the paper, I give them an A for effort—they used some interesting methodologies—but there’s a serious problem with it that destroys many of its conclusions. Here’s 3 different measures they used of a post’s quality:
q’: Quality as determined by blinded users given instructions on how to vote.
p: upvotes / (upvotes + downvotes)
q: Prediction for p, based on bigram frequencies of the post, trained on known p for half the dataset
q is the measure they used for most of their conclusions. Note that it is supposed to represent quality, but is based entirely on bigrams. This doesn’t pass the sniff test. Whatever q measures, it isn’t quality. At best it’s grammaticality. It is more likely a prediction of rating based on the user’s identity (individuals have identifiable bigram counts) or politics (“liberal media” and “death tax” vs. “pro choice” and “hate crime”).
q is a prediction for p. p is a proxy for q’. There is no direct connection between q’ and q—no reason to think they will have any correlation not mediated by p.
R-squared values:
q to p: 0.04 (unless it is a typo when it says “mean R = 0.22” and should actually say “mean R^2 = 0.22″)
q to q’: 0.25
q’ to p: 0.12
First, the R-squared between q’, quality scores by judges, and p, community rating, is 0.12. That’s crap. It means that votes are almost unrelated to post quality.
Next, the strongest correlation is between q and q’, but the maximum possible causal correlation between them is 0.04 * 0.12 = 0.0048, because there is no causal connection between them except p.
That means that q, the machine-learned prediction they use for their study, has an acausal correlation with q’, post quality, that is 50 times stronger than the causal correlation.
In other words, all their numbers are bullshit. They aren’t produced by post quality, nor by user voting patterns. There is something wrong with how they’ve processed their data that has produced an artifactual correlation.