This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
It is the single solitary thing any person who knows any stats at all knows.
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I seem to recall the same error made by Gwern (and pointed out).
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey.
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end.
See above on why this is pointless and inappropriate.
That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
See above on why this is pointless and inappropriate.
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
Nope, I don’t think you have. Not everyone is crazy, but scholarship is pretty atrocious.