Do the findings apply to physics? Math? Computer science?
Different fields use different methods. The basic point Ioannidis makes applies to any field which uses null-hypothesis significance-testing statistics for interpreting sampled data.
Math uses formal proofs, so whatever math error rates are (non-zero and meaningful, but not sure how big), they are independent of NHST’s problems.
Experimental physics seems to use a lot of NHST too but they obey the critique by increasing power substantially: reducing measurement error and gathering enormous masses of data, more than is feasible in the other fields, so many n they can use the famed six-sigma alpha, which translates to very high PPV. They’re also helped by the commitment to falsifiable narrow predictions of things like intervals rather than directions (hypothesis testing works much better if you can predict the Higgs’s mass lies within a narrow range rather than having a null hypothesis of mass equals zero and an alternative of mass is non-zero; if you’re interested, see Paul Meehl’s methodological papers on why this is important).
Computer science is tricky:
the mathy parts are math and are safe (but not necessarily important or worth doing),
but other areas like systems work or machine learning may use NHST techniques or may not; there seem to be a lot of replicability problems in optimization work due to variation from machine to machine, and in machine learning I’ve heard many insinuations that papers get published by p-hacking hyperparameters until finally the new algorithm is p<0.05 better than the comparison algorithm or that the new tweak is just overfitting on a standard dataset, and some subfields are visibly rotten (HCI especially; you only have to look at how routine it is for HCI papers to claim an improvement based on NHST techniques applied to n=10 or something to know that those ain’t gonna replicate)… but aside from a few critical papers like “Producing Wrong Data Without Doing Anything Obviously Wrong!” I don’t know of any general argument that most CS research is wrong.
It would be interesting to weight fields by publication count to see if Ioannidis’s title, interpreted literally, is still right. When one criticizes ‘ecology, medicine, biology, psychology, economics’, one is criticizing what must be at least hundreds of thousands of papers every year—those are big fields. I don’t know that math, physics, theoretical CS etc publish enough papers to offset that.
Different fields use different methods. The basic point Ioannidis makes applies to any field which uses null-hypothesis significance-testing statistics for interpreting sampled data.
Math uses formal proofs, so whatever math error rates are (non-zero and meaningful, but not sure how big), they are independent of NHST’s problems.
Ecology, medicine, biology, psychology, economics—heavy NHST users, critique definitely applies.
Experimental physics seems to use a lot of NHST too but they obey the critique by increasing power substantially: reducing measurement error and gathering enormous masses of data, more than is feasible in the other fields, so many n they can use the famed six-sigma alpha, which translates to very high PPV. They’re also helped by the commitment to falsifiable narrow predictions of things like intervals rather than directions (hypothesis testing works much better if you can predict the Higgs’s mass lies within a narrow range rather than having a null hypothesis of mass equals zero and an alternative of mass is non-zero; if you’re interested, see Paul Meehl’s methodological papers on why this is important).
Computer science is tricky:
the mathy parts are math and are safe (but not necessarily important or worth doing),
but other areas like systems work or machine learning may use NHST techniques or may not; there seem to be a lot of replicability problems in optimization work due to variation from machine to machine, and in machine learning I’ve heard many insinuations that papers get published by p-hacking hyperparameters until finally the new algorithm is p<0.05 better than the comparison algorithm or that the new tweak is just overfitting on a standard dataset, and some subfields are visibly rotten (HCI especially; you only have to look at how routine it is for HCI papers to claim an improvement based on NHST techniques applied to n=10 or something to know that those ain’t gonna replicate)… but aside from a few critical papers like “Producing Wrong Data Without Doing Anything Obviously Wrong!” I don’t know of any general argument that most CS research is wrong.
It would be interesting to weight fields by publication count to see if Ioannidis’s title, interpreted literally, is still right. When one criticizes ‘ecology, medicine, biology, psychology, economics’, one is criticizing what must be at least hundreds of thousands of papers every year—those are big fields. I don’t know that math, physics, theoretical CS etc publish enough papers to offset that.
I agree 100%.