Really? 11 of the 12 stories got rated higher when spoiled, which is decent evidence against the nil hypothesis (spoilers have zero effect on hedonic ratings) regardless of the error bars’ size. Under the nil hypothesis, each story has a 50⁄50 chance of being rated higher when spoiled, giving a probability of (¹²C₁₁ × 0.5¹¹ × 0.5¹) + (¹²C₁₂ × 0.5¹² × 0.5⁰) = 0.0032 that ≥11 stories get a higher rating when spoiled. So the nil hypothesis gets rejected with a p-value of 0.0063 (the probability’s doubled to make the test two-tailed), and presumably the results are still stronger evidence against a spoilers-are-bad hypothesis.
This, of course, doesn’t account for unseen confounders, inter-individual variation in hedonic spoiler effects, publication bias, or the sample (79% female and taken from “the psychology subject pool at the University of California, San Diego”) being unrepresentative of people in general. So you’re still not necessarily a total freak!
Yeah, it doesn’t seem likely given that study that works are liked in average less when spoiled; but what I meant is that probably there are certain individuals who like works less when spoiled. (Imagine Alice said something to the effect that she prefers chocolate ice cream to vanilla ice cream, and Bob said that it’s not actually the case that vanilla tastes worse than chocolate, citing a study in which for 11 out of 12 ice cream brands their vanilla ice cream is liked more in average than their chocolate ice cream—though in most cases the difference between the averages is not much bigger than each standard deviation; even if the study was conducted among a demographic that does include Alice, that still wouldn’t necessarily mean Alice is mistaken, lying, or particularly unusual, would it?)
Just so. These are the sort of “inter-individual variation in hedonic spoiler effects” I had in mind earlier.
Edit: to elaborate a bit, it was the “error bars look large enough” bit of your earlier comment that triggered my sceptical “Really?” reaction. Apart from that bit I agree(d) with you!
Edit 2: aha, I probably did misunderstand you earlier. I originally interpreted your error bars comment as a comment on the statistical significance of the pairwise differences in bar length, but I guess you were actually ballparking the population standard deviation of spoiler effect from the sample size and the standard errors of the means.
These are the sort of “inter-individual variation in hedonic spoiler effects” I had in mind earlier.
Huh. For some reason I had read that as “intra-individual”. Whatever happened to the “assume people are saying something reasonable” module in my brain?
I guess you were actually ballparking the population standard deviation of spoiler effect from the sample size and the standard errors of the means.
You can’t just ignore the error bars like that. In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
Ignoring the error bars does throw away potentially useful information, and this does break the rules of Bayes Club. But this makes the test a conservative one (Wikipedia: “it has very general applicability but may lack the statistical power of other tests”), which just makes the rejection of the nil hypothesis all the more convincing.
In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
If I’m interpreting this correctly, “the error bars overlap” means that the heights of two adjacent bars are within ≈2 standard errors of each other. In that case, overlapping error bars doesn’t necessarily indicate a decent chance that the comparisons could go either way; a 2 std. error difference is quite a big one.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
But this is an invalid application of the test. The sign test already allows for the possibility that each pairwise comparison can have the wrong sign. Making your own adjustments to the numbers before feeding them into the test is an overcorrection. (Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!)
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
This reminds me of a nice advantage of the sign test. One needn’t worry about squinting at error bars; it suffices to be able to see which of each pair of solid bars is longer!
Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!
Okay, if all you’re testing is that “there exist stories for which spoilers make reading more fun” then yes, you’re done at that point. As far as I’m concerned, it’s obvious that such stories exist for either direction; the conclusion “spoilers are good” or “spoilers are bad” follows if one type of story dominates.
I don’t like the study setup there. One readthrough of spoiled vs one readthrough of unspoiled material lets you compare the participants’ hedonic ratings of dramatic irony vs mystery, and it’s quite reasonable that the former would be equally or more enjoyable… but unlike in the study, in real life unspoiled material can be read twice: the first time for the mystery, then the second time for the dramatic irony; with spoiled material you only get the latter.
Spoilers matter less than you think.
According to a single counter-intuitive (and therefore more likely to make headlines), unreplicated study.
Gah! Spoiler!
Those error bars look large enough that I could still be right about myself even without being a total freak.
Really? 11 of the 12 stories got rated higher when spoiled, which is decent evidence against the nil hypothesis (spoilers have zero effect on hedonic ratings) regardless of the error bars’ size. Under the nil hypothesis, each story has a 50⁄50 chance of being rated higher when spoiled, giving a probability of (¹²C₁₁ × 0.5¹¹ × 0.5¹) + (¹²C₁₂ × 0.5¹² × 0.5⁰) = 0.0032 that ≥11 stories get a higher rating when spoiled. So the nil hypothesis gets rejected with a p-value of 0.0063 (the probability’s doubled to make the test two-tailed), and presumably the results are still stronger evidence against a spoilers-are-bad hypothesis.
This, of course, doesn’t account for unseen confounders, inter-individual variation in hedonic spoiler effects, publication bias, or the sample (79% female and taken from “the psychology subject pool at the University of California, San Diego”) being unrepresentative of people in general. So you’re still not necessarily a total freak!
Yeah, it doesn’t seem likely given that study that works are liked in average less when spoiled; but what I meant is that probably there are certain individuals who like works less when spoiled. (Imagine Alice said something to the effect that she prefers chocolate ice cream to vanilla ice cream, and Bob said that it’s not actually the case that vanilla tastes worse than chocolate, citing a study in which for 11 out of 12 ice cream brands their vanilla ice cream is liked more in average than their chocolate ice cream—though in most cases the difference between the averages is not much bigger than each standard deviation; even if the study was conducted among a demographic that does include Alice, that still wouldn’t necessarily mean Alice is mistaken, lying, or particularly unusual, would it?)
Just so. These are the sort of “inter-individual variation in hedonic spoiler effects” I had in mind earlier.
Edit: to elaborate a bit, it was the “error bars look large enough” bit of your earlier comment that triggered my sceptical “Really?” reaction. Apart from that bit I agree(d) with you!
Edit 2: aha, I probably did misunderstand you earlier. I originally interpreted your error bars comment as a comment on the statistical significance of the pairwise differences in bar length, but I guess you were actually ballparking the population standard deviation of spoiler effect from the sample size and the standard errors of the means.
Huh. For some reason I had read that as “intra-individual”. Whatever happened to the “assume people are saying something reasonable” module in my brain?
Yep.
You can’t just ignore the error bars like that. In 8 of the 12 cases, the error bars overlap, which means there’s a decent chance that those comparisons could have gone either way, even assuming the sample mean is exactly correct. A spoilers-are-good hypothesis still has to bear the weight of this element of chance.
As a rough estimate: I’d say we can be sure that 4 stories are definitely better spoilered (>2 sd’s apart); out of the ones 1..2 sd’s apart, maybe 3 are actually better spoilered; and out of the remainder, they could’ve gone either way. So we have maybe 9 out of 12 stories that are better with spoilers, which gives a probability of 14.5% if we do the same two-tailed test on the same null hypothesis.
I don’t necessarily want you to trust the numbers above, because I basically eyeballed everything; however, it gives an idea of why error bars matter.
Ignoring the error bars does throw away potentially useful information, and this does break the rules of Bayes Club. But this makes the test a conservative one (Wikipedia: “it has very general applicability but may lack the statistical power of other tests”), which just makes the rejection of the nil hypothesis all the more convincing.
If I’m interpreting this correctly, “the error bars overlap” means that the heights of two adjacent bars are within ≈2 standard errors of each other. In that case, overlapping error bars doesn’t necessarily indicate a decent chance that the comparisons could go either way; a 2 std. error difference is quite a big one.
But this is an invalid application of the test. The sign test already allows for the possibility that each pairwise comparison can have the wrong sign. Making your own adjustments to the numbers before feeding them into the test is an overcorrection. (Indeed, if “we can be sure that 4 stories are definitely better spoilered”, there’s no need to statistically test the nil hypothesis because we already have definite evidence that it is false!)
This reminds me of a nice advantage of the sign test. One needn’t worry about squinting at error bars; it suffices to be able to see which of each pair of solid bars is longer!
Okay, if all you’re testing is that “there exist stories for which spoilers make reading more fun” then yes, you’re done at that point. As far as I’m concerned, it’s obvious that such stories exist for either direction; the conclusion “spoilers are good” or “spoilers are bad” follows if one type of story dominates.
I don’t like the study setup there. One readthrough of spoiled vs one readthrough of unspoiled material lets you compare the participants’ hedonic ratings of dramatic irony vs mystery, and it’s quite reasonable that the former would be equally or more enjoyable… but unlike in the study, in real life unspoiled material can be read twice: the first time for the mystery, then the second time for the dramatic irony; with spoiled material you only get the latter.