Can you give me a concrete course of action to take when I am writing a paper reporting my results?
Suppose I have created two versions of a website, and timed 30 people completing a task on each web site. The people on the second website were faster. I want my readers to believe that this wasn’t merely a statistical coincidence. Normally, I would do a t-test to show this. What are you proposing I do instead? I don’t want a generalization like “use Bayesian statistics, ” but a concrete example of how one would test the data and report it in a paper.
Credible intervals do not make worst case guarantees, but average case guarantees (given your prior). There is nothing wrong with confidence intervals as a worst case guarantee technique. To grandparent: I wouldn’t take statistical methodology advice from lesswrong. If you really need such advice, ask a smart frequentist and a smart bayesian.
Yes, in this case you could keep using p-values (if you really wanted to...), but with reference to the value of, say, each customer. (This is what I meant by setting the threshold with respect to decision theory.) If the goal is to use on a site making millions of dollars*, 0.01 may be too loose a threshold, but if he’s just messing with his personal site to help readers, a p-value like 0.10 may be perfectly acceptable.
* If the results were that important, I think there’d be better approaches than a once-off a/b test. Adaptive multi-armed bandit algorithms sound really cool from what I’ve read of them.
I’d suggest more of a scattergram than a histogram; superimposing 95% CIs would then cover the exploratory data/visualization & confidence intervals. Combine that with an effect size and one has made a good start.
Can you give me a concrete course of action to take when I am writing a paper reporting my results? Suppose I have created two versions of a website, and timed 30 people completing a task on each web site. The people on the second website were faster. I want my readers to believe that this wasn’t merely a statistical coincidence. Normally, I would do a t-test to show this. What are you proposing I do instead? I don’t want a generalization like “use Bayesian statistics, ” but a concrete example of how one would test the data and report it in a paper.
You could use Bayesian estimation to compute credible differences in mean task completion time between your groups.
Described in excruciating detail in this pdf.
Credible intervals do not make worst case guarantees, but average case guarantees (given your prior). There is nothing wrong with confidence intervals as a worst case guarantee technique. To grandparent: I wouldn’t take statistical methodology advice from lesswrong. If you really need such advice, ask a smart frequentist and a smart bayesian.
Perhaps you would suggest showing the histograms of completion times on each site, along with the 95% confidence error bars?
Presumably not actually 95%, but, as gwern said, a threshold based on the cost of false positives.
Yes, in this case you could keep using p-values (if you really wanted to...), but with reference to the value of, say, each customer. (This is what I meant by setting the threshold with respect to decision theory.) If the goal is to use on a site making millions of dollars*, 0.01 may be too loose a threshold, but if he’s just messing with his personal site to help readers, a p-value like 0.10 may be perfectly acceptable.
* If the results were that important, I think there’d be better approaches than a once-off a/b test. Adaptive multi-armed bandit algorithms sound really cool from what I’ve read of them.
I’d suggest more of a scattergram than a histogram; superimposing 95% CIs would then cover the exploratory data/visualization & confidence intervals. Combine that with an effect size and one has made a good start.