Lumifer comments on Should you write longer comments? (Statistical analysis of the relationship between comment length and ratings)

Lumifer 20 Jul 2015 16:05 UTC
8 points
I think these plots would by much improved by adding error bars. In particular, I suspect that the number of short posts is greater than the number of long posts and so the average-karma estimates for long posts are more uncertain.

Also, did you bucketize the word counts? What do specific points on your plots correspond to?
- cleonid 20 Jul 2015 22:32 UTC
  2 points
  Parent
  Each point on the graph corresponds to an average of several hundred (about two thousand for the middle graph) data points. A number of short posts is indeed greater than the number of long posts, so the horizontal distance between the points on the graph increases with increasing number of characters.
  - Lumifer 21 Jul 2015 1:30 UTC
    5 points
    Parent
    Any particular reason you did a plot this way instead of having a cloud of points and drawing some kind of regression line or curve through? You are unnecessarily losing information by aggregating into buckets.
    - cleonid 21 Jul 2015 11:43 UTC
      0 points
      Parent
      True, but it is virtually impossible to see a meaningful pattern when you have thousands data points on the graph and R2<0.2.
      - Douglas_Knight 22 Jul 2015 5:28 UTC
        0 points
        Parent
        I disagree. I find point clouds useful, as long as they are not pure black. Kernel density plots are better, though.
        
        But Lumifer gave you a concrete suggestion: plot a regression curve, not a bunch of buckets. Bucketing and drawing lines between points are kinds of smoothing, so you should instead use a good smoothing. Say, loess. Just use ggplot and trust its defaults. (not loess with this many points)
      - Lumifer 21 Jul 2015 16:30 UTC
        0 points
        Parent
        Well, one question is if it’s “impossible to see a meaningful pattern”, should you melt-and-recast the data so that the pattern appears X-/
        
        Another observation is that you are constrained by Excel. R can deal with such problems easily—do you have the raw dataset available somewhere?