You can’t just ignore the error bars like that. In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
Ignoring the error bars does throw away potentially useful information, and this does break the rules of Bayes Club. But this makes the test a conservative one (Wikipedia: “it has very general applicability but may lack the statistical power of other tests”), which just makes the rejection of the nil hypothesis all the more convincing.
In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
If I’m interpreting this correctly, “the error bars overlap” means that the heights of two adjacent bars are within ≈2 standard errors of each other. In that case, overlapping error bars doesn’t necessarily indicate a decent chance that the comparisons could go either way; a 2 std. error difference is quite a big one.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
But this is an invalid application of the test. The sign test already allows for the possibility that each pairwise comparison can have the wrong sign. Making your own adjustments to the numbers before feeding them into the test is an overcorrection. (Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!)
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
This reminds me of a nice advantage of the sign test. One needn’t worry about squinting at error bars; it suffices to be able to see which of each pair of solid bars is longer!
Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!
Okay, if all you’re testing is that “there exist stories for which spoilers make reading more fun” then yes, you’re done at that point. As far as I’m concerned, it’s obvious that such stories exist for either direction; the conclusion “spoilers are good” or “spoilers are bad” follows if one type of story dominates.
You can’t just ignore the error bars like that. In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
Ignoring the error bars does throw away potentially useful information, and this does break the rules of Bayes Club. But this makes the test a conservative one (Wikipedia: “it has very general applicability but may lack the statistical power of other tests”), which just makes the rejection of the nil hypothesis all the more convincing.
If I’m interpreting this correctly, “the error bars overlap” means that the heights of two adjacent bars are within ≈2 standard errors of each other. In that case, overlapping error bars doesn’t necessarily indicate a decent chance that the comparisons could go either way; a 2 std. error difference is quite a big one.
But this is an invalid application of the test. The sign test already allows for the possibility that each pairwise comparison can have the wrong sign. Making your own adjustments to the numbers before feeding them into the test is an overcorrection. (Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!)
This reminds me of a nice advantage of the sign test. One needn’t worry about squinting at error bars; it suffices to be able to see which of each pair of solid bars is longer!
Okay, if all you’re testing is that “there exist stories for which spoilers make reading more fun” then yes, you’re done at that point. As far as I’m concerned, it’s obvious that such stories exist for either direction; the conclusion “spoilers are good” or “spoilers are bad” follows if one type of story dominates.