The survey wasn’t timed—so maybe those more “into” the site put more time and effort into answering the questions. So: I don’t think much in the way of conclusions about bias can be drawn.
Looking, like before, at number of missing answers (which seems like an awful good proxy for how much time one puts into the survey), the people who were right on the first question answer 1 more question but that small difference doesn’t reach statistical significance:
R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> right <- lw[as.character(lw$CFARQuestion1) == "Yes",]$MissingAnswers
R> wrong <- lw[as.character(lw$CFARQuestion1) == "no" | as.character(lw$CFARQuestion1) == "Cannot be determined",]$MissingAnswers
R> t.test(right, wrong)
Welch Two Sample t-test
data: right and wrong
t = -1.542, df = 942.5, p-value = 0.1234
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.3817 0.2858
sample estimates:
mean of x mean of y
16.69 17.74
(I’m not going to look at the other questions unless someone really wants me to, since the first question is the one that would benefit the most from some extended thought, unlike the others.)
Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there’s no real relationship*:
R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> lw <- lw[as.character(lw$CFARQuestion1) != " " & !is.na(as.character(lw$CFARQuestion1)),]
R> lw <- data.frame(lw$CFARQuestion1, lw$MissingAnswers)
R> summary(glm(lw.CFARQuestion1 ~ lw.MissingAnswers, data = lw, family = "binomial"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.17 -1.12 -1.05 1.23 1.41
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.00111 0.12214 0.01 0.99
lw.MissingAnswers -0.00900 0.00607 -1.48 0.14
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1366.6 on 989 degrees of freedom
Residual deviance: 1364.4 on 988 degrees of freedom
AIC: 1368
Number of Fisher Scoring iterations: 3
* a note to other analyzers: it’s really important to remove null answers/NAs because they’ll show relationships all over the place. In this example, if you leave NAs in for the CFARQuestion1 field, you’ll wind up getting a very statistically significant relationship—because every CFARQuestion left NA by definition increases MissingAnswers by 1! And people who didn’t answer that question probably didn’t answer a lot of other questions, so the NA respondents enable a very easy reliable prediction of MissingAnswers…
Markdown code syntax is indent each line by >=4 spaces; LW’s implementation is subtly broken since it’s stripping all the internal indentation, and another gotcha is that you can’t have any trailing whitespace or lines will be combined in a way you probably don’t want.
MediaWiki syntax is entirely different and partially depends on what extensions are enabled.
The survey wasn’t timed—so maybe those more “into” the site put more time and effort into answering the questions. So: I don’t think much in the way of conclusions about bias can be drawn.
Looking, like before, at number of missing answers (which seems like an awful good proxy for how much time one puts into the survey), the people who were right on the first question answer 1 more question but that small difference doesn’t reach statistical significance:
(I’m not going to look at the other questions unless someone really wants me to, since the first question is the one that would benefit the most from some extended thought, unlike the others.)
Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there’s no real relationship*:
* a note to other analyzers: it’s really important to remove null answers/NAs because they’ll show relationships all over the place. In this example, if you leave NAs in for the
CFARQuestion1
field, you’ll wind up getting a very statistically significant relationship—because everyCFARQuestion
left NA by definition increasesMissingAnswers
by 1! And people who didn’t answer that question probably didn’t answer a lot of other questions, so the NA respondents enable a very easy reliable prediction ofMissingAnswers
…How do you get this nice box for the code? What’s the magic command that you have to tell the Wiki?
Markdown code syntax is indent each line by >=4 spaces; LW’s implementation is subtly broken since it’s stripping all the internal indentation, and another gotcha is that you can’t have any trailing whitespace or lines will be combined in a way you probably don’t want.
MediaWiki syntax is entirely different and partially depends on what extensions are enabled.
That seems like a rather post hoc explanation.