First, let me point out that the “behavioral changes” that the authors described were investigated over only three posts subsequent to each positive/negative evaluation, so it is unclear whether these effects remain over the long term.
Second, I find questionable the authors’ conclusion that negative evaluations cause the subsequent decline in post quality and increase in post frequency, since they did not control the positive/negative evaluations. They model the positive/negative evaluations as random acts of chance (which is what we want for an RCT) and justify this by reporting that their bigram classifier assigns no difference in quality between the positively- and negatively-evaluated posts (across two posts by a pair of matched subjects). However, I find it likely that their classifier makes sufficiently many misclassifications to call into question their conclusion.
For instance, if bad posts have a tendency to occur in streaks of frequent posts (as is the case in flame wars#Flame_war)), then we can explain their observations without assigning causal potency to negative evaluations: once in a while the classifier will erroneously assign a high quality to a bad post near the start of a flame war, but on average it will correctly assign low qualities to the subsequent three posts by the same poster in the flame war, and thus we see the effects that the authors described (without assigning any causal effect to the negative evaluation given by other users to the post near the start of the flame war). To test this explanation, the authors can ask the Crowdflower workers (p. 4) to label each b_0 (described on p. 5) to check whether their classifier is indeed misclassifying b_0 by assigning it too high a quality.
Since the authors did not conduct an RCT, we can come up with many alternative explanations, and I find them plausible. (Is it feasible to conduct an RCT on a site featuring upvotes and downvotes? Yes, it’s been done before.)
Despite my criticisms, I think the paper is not bad. I just don’t think the authors’ methods provide sufficient evidence to warrant their seemingly strong confidence in their conclusions.
Second, I find questionable the authors’ conclusion that negative evaluations cause the subsequent decline in post quality and increase in post frequency, since they did not control the positive/negative evaluations. They model the positive/negative evaluations as random acts of chance
If a community really votes as random acts of chance, that explains that the voting doesn’t lead to good behavior ;)
First, let me point out that the “behavioral changes” that the authors described were investigated over only three posts subsequent to each positive/negative evaluation, so it is unclear whether these effects remain over the long term.
Second, I find questionable the authors’ conclusion that negative evaluations cause the subsequent decline in post quality and increase in post frequency, since they did not control the positive/negative evaluations. They model the positive/negative evaluations as random acts of chance (which is what we want for an RCT) and justify this by reporting that their bigram classifier assigns no difference in quality between the positively- and negatively-evaluated posts (across two posts by a pair of matched subjects). However, I find it likely that their classifier makes sufficiently many misclassifications to call into question their conclusion.
For instance, if bad posts have a tendency to occur in streaks of frequent posts (as is the case in flame wars#Flame_war)), then we can explain their observations without assigning causal potency to negative evaluations: once in a while the classifier will erroneously assign a high quality to a bad post near the start of a flame war, but on average it will correctly assign low qualities to the subsequent three posts by the same poster in the flame war, and thus we see the effects that the authors described (without assigning any causal effect to the negative evaluation given by other users to the post near the start of the flame war). To test this explanation, the authors can ask the Crowdflower workers (p. 4) to label each b_0 (described on p. 5) to check whether their classifier is indeed misclassifying b_0 by assigning it too high a quality.
Since the authors did not conduct an RCT, we can come up with many alternative explanations, and I find them plausible. (Is it feasible to conduct an RCT on a site featuring upvotes and downvotes? Yes, it’s been done before.)
Despite my criticisms, I think the paper is not bad. I just don’t think the authors’ methods provide sufficient evidence to warrant their seemingly strong confidence in their conclusions.
If a community really votes as random acts of chance, that explains that the voting doesn’t lead to good behavior ;)