Often I want to form a quick impression as to whether it is worth me analysing a given paper in more detail. A couple of quick calculations can go a long way. Some of this will be obvious but I’ve tried to give the approximate thresholds for the results which up until now I’ve been using subconsciously. I’d be very interested to hear other people’s thresholds.
Calculations
Calculate how many p-values (could) have been calculated.
If the study and analysis techniques were pre-registered then count how many p-values were calculated.
If the study was not pre-registered, calculate how many different p-values could have been calculated (had the data looked different) which would have been equally justified as the ones that they did calculate (see Gelman’s garden of forking paths). This depends on how aggressive any hacking has been but roughly speaking I’d calculate:
Number of input variables (including interactions) x Number of measurement variables
Calculate expected number of type I errors
Multiply answer from previous step by the threshold p-value of the paper
Different results may have different thresholds which makes life a little more complicated
Estimate Cohen’s d for the experiment (without looking at the actual result!)
One option in estimating effect size is to not consider the specific intervention, but just to estimate how easy the target variable is to move for any intervention – see putanumonit for a more detailed explanation. I wouldn’t completely throw away my prior on how effective the particular intervention in question is, but I do consider it helpful advice to not let my prior act too powerfully.
Calculate experimental power
You can calculate this properly but alternatively can use Lehr’s formula. Sample size equations for different underlying distributions can be found here.
To get Power > 0.8 we require sample size per group of:
N>16Cohen′sd2
This is based on pthreshold=0.05, single p-value calculated, 2 samples of equal size, 2 tailed t-test.
A modification to this rule to account for multiple p-values would be to add 3.25 to the numerator for each doubling of the number of p-values calculated previously.
N>16+3.25×log2(np−values)Cohen′sd2
If sample sizes are very unequal (ratio of >10) then the number required in the smaller sample is the above calculation divided by 2. This also works for single sample tests against a fixed value.
Thresholds
Roughly speaking, if expected type I errors is above 0.25 I’ll write the study off, between 0.05 and 0.25 I’ll be suspicious. If multiple significant p-values are found this gets a bit tricky due to non-independence of the p-values so more investigation may be required.
If sample size is sufficient for power > 0.8 then I’m happy. If it comes out below then I’m suspicious and have to check whether my estimation for Cohen’s d is reasonable. If I’m still convinced N is a long way from being large enough I’ll write the study off. Obviously as the paper has been published the calculated Cohen’s d is large enough to get a significant result but the question is do I believe that the effect size calculated is reasonable.
Test
I tried Lehr’s formula on the 80,000 hours replication quiz. Of the 21 replications, my calculation gave a decisive answer in 17 papers, getting them all correct − 9 studies with comfortably oversized samples replicated successfully, 8 studies with massively undersized samples (less than half the required sample size I calculated) failed to replicate. Of the remaining 4 where the sample sizes were 0.5 – 1.2 x my estimate from Lehr’s equation, all successfully replicated.
(I remembered the answer to most of the replications but tried my hardest to ignore this when estimating Cohen’s d.)
Just having a fixed minimum N wouldn’t have worked nearly as well – of the 5 smallest studies only 1 failed to replicate.
Essentially getting good grades and having a desk in your room are apparently good predictors of whether you want to go to university or not. The former seemed sensible, the latter seemed like it shouldn’t have a big effect size but I wanted to give it a chance.
Just from the abstract you can tell there are at least 8 input variables so the numerator on Lehr’s equation becomes ~26. This means a cohen’s d of 0.1 (which I feel is pretty generous for having a desk in your room) would require 2600 results in each sample.
As the samples are unlikely to be of equal size, I would estimate they would need a total of ~10,000 samples for this to have any chance of finding a meaningful result for smaller effect sizes.
The actual number of samples was ~1,000. At this point I would normally write off the study without bothering to go deeper, the process taking less than 5 minutes.
I was curious to see how they managed to get multiple significant results despite the sample size limitations. It turns out that they decided against reporting p-values because “we could no longer assume randomness of the sample”. Instead they report the odds ratio of each result and said that anything with a large ratio had an effect, ignoring any uncertainty of the results.
It turns out there were only 108 students in the no-desk sample. Definitely what Andrew Gelman calls a Kangaroo measurement.
There are a lot of other problems with the paper but just looking at the sample size (even though the sample size was ~1,000) was a helpful check to confidently reject the paper with minimal effort.
1. For reasonable assumptions if you’re studying an interaction then you might need 16x larger samples—see Gelman. Essentially standard error is double for interactions and Andrew thinks that interaction effects being half the size of main effects is a good starting point for estimates, giving (2×2)2=16 times larger samples.
2. When estimating cohen’s d, it is important that you know whether the study is between or within subjects—within subject studies will give much lower standard error and thus require much smaller samples. Again Gelman discusses.
Often I want to form a quick impression as to whether it is worth me analysing a given paper in more detail. A couple of quick calculations can go a long way. Some of this will be obvious but I’ve tried to give the approximate thresholds for the results which up until now I’ve been using subconsciously. I’d be very interested to hear other people’s thresholds.
Calculations
Calculate how many p-values (could) have been calculated.
If the study and analysis techniques were pre-registered then count how many p-values were calculated.
If the study was not pre-registered, calculate how many different p-values could have been calculated (had the data looked different) which would have been equally justified as the ones that they did calculate (see Gelman’s garden of forking paths). This depends on how aggressive any hacking has been but roughly speaking I’d calculate:
Number of input variables (including interactions) x Number of measurement variables
Calculate expected number of type I errors
Multiply answer from previous step by the threshold p-value of the paper
Different results may have different thresholds which makes life a little more complicated
Estimate Cohen’s d for the experiment (without looking at the actual result!)
One option in estimating effect size is to not consider the specific intervention, but just to estimate how easy the target variable is to move for any intervention – see putanumonit for a more detailed explanation. I wouldn’t completely throw away my prior on how effective the particular intervention in question is, but I do consider it helpful advice to not let my prior act too powerfully.
Calculate experimental power
You can calculate this properly but alternatively can use Lehr’s formula. Sample size equations for different underlying distributions can be found here.
To get Power > 0.8 we require sample size per group of:
This is based on pthreshold=0.05, single p-value calculated, 2 samples of equal size, 2 tailed t-test.
A modification to this rule to account for multiple p-values would be to add 3.25 to the numerator for each doubling of the number of p-values calculated previously.
If sample sizes are very unequal (ratio of >10) then the number required in the smaller sample is the above calculation divided by 2. This also works for single sample tests against a fixed value.
Thresholds
Roughly speaking, if expected type I errors is above 0.25 I’ll write the study off, between 0.05 and 0.25 I’ll be suspicious. If multiple significant p-values are found this gets a bit tricky due to non-independence of the p-values so more investigation may be required.
If sample size is sufficient for power > 0.8 then I’m happy. If it comes out below then I’m suspicious and have to check whether my estimation for Cohen’s d is reasonable. If I’m still convinced N is a long way from being large enough I’ll write the study off. Obviously as the paper has been published the calculated Cohen’s d is large enough to get a significant result but the question is do I believe that the effect size calculated is reasonable.
Test
I tried Lehr’s formula on the 80,000 hours replication quiz. Of the 21 replications, my calculation gave a decisive answer in 17 papers, getting them all correct − 9 studies with comfortably oversized samples replicated successfully, 8 studies with massively undersized samples (less than half the required sample size I calculated) failed to replicate. Of the remaining 4 where the sample sizes were 0.5 – 1.2 x my estimate from Lehr’s equation, all successfully replicated.
(I remembered the answer to most of the replications but tried my hardest to ignore this when estimating Cohen’s d.)
Just having a fixed minimum N wouldn’t have worked nearly as well – of the 5 smallest studies only 1 failed to replicate.
I just came across an example of this which might be helpful.
Essentially getting good grades and having a desk in your room are apparently good predictors of whether you want to go to university or not. The former seemed sensible, the latter seemed like it shouldn’t have a big effect size but I wanted to give it a chance.
The paper itself is here.
Just from the abstract you can tell there are at least 8 input variables so the numerator on Lehr’s equation becomes ~26. This means a cohen’s d of 0.1 (which I feel is pretty generous for having a desk in your room) would require 2600 results in each sample.
As the samples are unlikely to be of equal size, I would estimate they would need a total of ~10,000 samples for this to have any chance of finding a meaningful result for smaller effect sizes.
The actual number of samples was ~1,000. At this point I would normally write off the study without bothering to go deeper, the process taking less than 5 minutes.
I was curious to see how they managed to get multiple significant results despite the sample size limitations. It turns out that they decided against reporting p-values because “we could no longer assume randomness of the sample”. Instead they report the odds ratio of each result and said that anything with a large ratio had an effect, ignoring any uncertainty of the results.
It turns out there were only 108 students in the no-desk sample. Definitely what Andrew Gelman calls a Kangaroo measurement.
There are a lot of other problems with the paper but just looking at the sample size (even though the sample size was ~1,000) was a helpful check to confidently reject the paper with minimal effort.
Additional thoughts:
1. For reasonable assumptions if you’re studying an interaction then you might need 16x larger samples—see Gelman. Essentially standard error is double for interactions and Andrew thinks that interaction effects being half the size of main effects is a good starting point for estimates, giving (2×2)2=16 times larger samples.
2. When estimating cohen’s d, it is important that you know whether the study is between or within subjects—within subject studies will give much lower standard error and thus require much smaller samples. Again Gelman discusses.