I think these plots would by much improved by adding error bars. In particular, I suspect that the number of short posts is greater than the number of long posts and so the average-karma estimates for long posts are more uncertain.
Also, did you bucketize the word counts? What do specific points on your plots correspond to?
Each point on the graph corresponds to an average of several hundred (about two thousand for the middle graph) data points. A number of short posts is indeed greater than the number of long posts, so the horizontal distance between the points on the graph increases with increasing number of characters.
Any particular reason you did a plot this way instead of having a cloud of points and drawing some kind of regression line or curve through? You are unnecessarily losing information by aggregating into buckets.
I disagree. I find point clouds useful, as long as they are not pure black. Kernel density plots are better, though.
But Lumifer gave you a concrete suggestion: plot a regression curve, not a bunch of buckets. Bucketing and drawing lines between points are kinds of smoothing, so you should instead use a good smoothing. Say, loess. Just use ggplot and trust its defaults. (not loess with this many points)
I think these plots would by much improved by adding error bars. In particular, I suspect that the number of short posts is greater than the number of long posts and so the average-karma estimates for long posts are more uncertain.
Also, did you bucketize the word counts? What do specific points on your plots correspond to?
Each point on the graph corresponds to an average of several hundred (about two thousand for the middle graph) data points. A number of short posts is indeed greater than the number of long posts, so the horizontal distance between the points on the graph increases with increasing number of characters.
Any particular reason you did a plot this way instead of having a cloud of points and drawing some kind of regression line or curve through? You are unnecessarily losing information by aggregating into buckets.
True, but it is virtually impossible to see a meaningful pattern when you have thousands data points on the graph and R2<0.2.
I disagree. I find point clouds useful, as long as they are not pure black. Kernel density plots are better, though.
But Lumifer gave you a concrete suggestion: plot a regression curve, not a bunch of buckets. Bucketing and drawing lines between points are kinds of smoothing, so you should instead use a good smoothing. Say, loess. Just use ggplot and trust its defaults. (not loess with this many points)
Well, one question is if it’s “impossible to see a meaningful pattern”, should you melt-and-recast the data so that the pattern appears X-/
Another observation is that you are constrained by Excel. R can deal with such problems easily—do you have the raw dataset available somewhere?