I think as usual with rationality stuff there’s a good analogy to statistics.
I’m very happy I never took Stats 101 and learned what a p value was in a math department “Theory of Statistics” class. Because as I understood it, Stats 101 teaches recipes, rules for when a conclusion is allowed. In the math department, I instead learned properties of algorithms for estimation and decision. There’s a certain interesting property of an estimation algorithm for the size of an effect: how large will that estimate be, if the effect is not there? Of a decision rule, you can ask: how often will the decision “effect is there” be made, if the effect is not there?
Frequentist statistical inference is based entirely on properties like these, and sometimes that works, and sometimes it doesn’t. But frequentist statistical inference is like a set of guidelines. Whether or not you agree with those guidelines, these properties exist. And if you understand what they mean, you can understand when frequentist statistical inference works decently and when it will act insanely.
I think what statistics, and LessWrong-style rationality have in common, is taking the procedure itself as an object of study. In statistics, it’s some algorithm you can run on a spreadsheet. On LessWrong, it tends to be something more vague, a pattern of human behavior.
My experience as a statistician among biologists was, honestly, depressing. One problem was power calculations. People want to know what power to plug into the sample size calculator. I would ask them, what probability are you willing to accept that you do all this work, and find nothing, even though the effect is really there? Maybe the problem is me, but I don’t think I ever got any engagement on this question. Eventually people look up what other people are doing, which is 80%. If I ask, are you willing to accept a 20% probability that your work results in nothing, even though the effect you’re looking for is actually present, I never really get an answer. What I wanted was not for them to follow any particular rule, like “only do experiments with 80% power”, especially since that can always be achieved by plugging in a high enough effect size in the calculation they put in their grant proposal. I wanted them to actually think through whether their experiment will actually work.
Another problem—whenever they had complex data, but were still just testing for a difference between groups, my answer was always “make up a measure of difference, then do a permutation test”. Nobody ever took me up on this. They were looking for a guideline to get it past the reviewers. It doesn’t matter that the made-up test has exactly the same guarantee as whatever test they eventually find: only positive 5% of the time it’s used in the absence of a real difference. But they don’t even know that’s the guarantee that frequentist tests come with.
I don’t really get what was going on. I think the biologists saw statistics as some confusing formality where people like me would yell at them if they got it wrong. Whereas if they follow the guidelines, nobody will yell at them. So they come to me asking for the guidelines, and instead I tell them some irrelevant nonsense about the chance that their conclusion will be correct.
I just want people to have the resources to think through whether the process by which they’re reaching a conclusion will reach the right conclusion. And use those resources. That’s all I guess.
I think as usual with rationality stuff there’s a good analogy to statistics.
I’m very happy I never took Stats 101 and learned what a p value was in a math department “Theory of Statistics” class. Because as I understood it, Stats 101 teaches recipes, rules for when a conclusion is allowed. In the math department, I instead learned properties of algorithms for estimation and decision. There’s a certain interesting property of an estimation algorithm for the size of an effect: how large will that estimate be, if the effect is not there? Of a decision rule, you can ask: how often will the decision “effect is there” be made, if the effect is not there?
Frequentist statistical inference is based entirely on properties like these, and sometimes that works, and sometimes it doesn’t. But frequentist statistical inference is like a set of guidelines. Whether or not you agree with those guidelines, these properties exist. And if you understand what they mean, you can understand when frequentist statistical inference works decently and when it will act insanely.
I think what statistics, and LessWrong-style rationality have in common, is taking the procedure itself as an object of study. In statistics, it’s some algorithm you can run on a spreadsheet. On LessWrong, it tends to be something more vague, a pattern of human behavior.
My experience as a statistician among biologists was, honestly, depressing. One problem was power calculations. People want to know what power to plug into the sample size calculator. I would ask them, what probability are you willing to accept that you do all this work, and find nothing, even though the effect is really there? Maybe the problem is me, but I don’t think I ever got any engagement on this question. Eventually people look up what other people are doing, which is 80%. If I ask, are you willing to accept a 20% probability that your work results in nothing, even though the effect you’re looking for is actually present, I never really get an answer. What I wanted was not for them to follow any particular rule, like “only do experiments with 80% power”, especially since that can always be achieved by plugging in a high enough effect size in the calculation they put in their grant proposal. I wanted them to actually think through whether their experiment will actually work.
Another problem—whenever they had complex data, but were still just testing for a difference between groups, my answer was always “make up a measure of difference, then do a permutation test”. Nobody ever took me up on this. They were looking for a guideline to get it past the reviewers. It doesn’t matter that the made-up test has exactly the same guarantee as whatever test they eventually find: only positive 5% of the time it’s used in the absence of a real difference. But they don’t even know that’s the guarantee that frequentist tests come with.
I don’t really get what was going on. I think the biologists saw statistics as some confusing formality where people like me would yell at them if they got it wrong. Whereas if they follow the guidelines, nobody will yell at them. So they come to me asking for the guidelines, and instead I tell them some irrelevant nonsense about the chance that their conclusion will be correct.
I just want people to have the resources to think through whether the process by which they’re reaching a conclusion will reach the right conclusion. And use those resources. That’s all I guess.