So I don’t really like the terminology of “p-hacking” and “replication”, and I’m going to use different terminology here that I find more precise and accurate; I don’t know how much I need to explain so you should ask if any particular term is unclear.
Also, I notice the term “p-hacking” appears nowhere in your post.
Indeed, because I don’t think it’s the relevant concept for ML research. I’m more concerned about the garden of forking paths. (I also strongly recommend Andrew Gelman’s blog, which has shaped my opinions on this topic a lot.)
Doing a similar experiment again is called “replication”. If the initial result was caused by p-hacking, then the similar experiment won’t support the theory. This is why we do replication.
Almost all ML experiments will replicate, for some definition of “replicate” -- if you run the same code, even with different random seeds, you will usually get the same results. (I would guess it is fairly rare for researchers to overfit to particular seeds, though it can happen.)
In research involving humans, usually some experimental details will vary, just by accident, and that variation leads to a natural “robustness check” that gives us a tiny bit of information about how externally valid the result is. We might not get even that with ML.
It’s also worth noting that even outside of ML, what does or doesn’t count as a “replication” varies widely, so I prefer not to use the term at all, and instead talk about how a separate experiment gives us information about the validity and generalization properties of a particular theory.
Why would you expect this?
Hopefully it’s a bit clearer now. Given what I’ve said in this comment, I might now restate the sentence as “I expect that the preregistered experiments will not be sufficiently different from the experiments the researchers ran themselves. So, I expect the results to be of limited use in informing me about the external validity of the experiment and the generalization properties of their theory.”
It seems to me that if you expect that the results of your experiment can be useful in and generalized to other situations, then it has to be possible to replicate it. Or to put it another way, if the principle you discovered is useful for more than running the same program with a different seed, shouldn’t it be possible to test it by some means other than running the same program with a different seed?
Or to put it another way, if the principle you discovered is useful for more than running the same program with a different seed, shouldn’t it be possible to test it by some means other than running the same program with a different seed?
Certainly. But even if the results are not useful and can’t be generalized to other situations, it’s probably possible to replicate it, in a way that’s slightly different from running the same program with a different seed. (E.g. you could run the same algorithm on a different environment that was constructed to be the kind of environment that algorithm could solve.) So this wouldn’t work as a test to distinguish between useful results and non-useful results.
So I don’t really like the terminology of “p-hacking” and “replication”, and I’m going to use different terminology here that I find more precise and accurate; I don’t know how much I need to explain so you should ask if any particular term is unclear.
Indeed, because I don’t think it’s the relevant concept for ML research. I’m more concerned about the garden of forking paths. (I also strongly recommend Andrew Gelman’s blog, which has shaped my opinions on this topic a lot.)
Almost all ML experiments will replicate, for some definition of “replicate” -- if you run the same code, even with different random seeds, you will usually get the same results. (I would guess it is fairly rare for researchers to overfit to particular seeds, though it can happen.)
In research involving humans, usually some experimental details will vary, just by accident, and that variation leads to a natural “robustness check” that gives us a tiny bit of information about how externally valid the result is. We might not get even that with ML.
It’s also worth noting that even outside of ML, what does or doesn’t count as a “replication” varies widely, so I prefer not to use the term at all, and instead talk about how a separate experiment gives us information about the validity and generalization properties of a particular theory.
Hopefully it’s a bit clearer now. Given what I’ve said in this comment, I might now restate the sentence as “I expect that the preregistered experiments will not be sufficiently different from the experiments the researchers ran themselves. So, I expect the results to be of limited use in informing me about the external validity of the experiment and the generalization properties of their theory.”
It seems to me that if you expect that the results of your experiment can be useful in and generalized to other situations, then it has to be possible to replicate it. Or to put it another way, if the principle you discovered is useful for more than running the same program with a different seed, shouldn’t it be possible to test it by some means other than running the same program with a different seed?
Certainly. But even if the results are not useful and can’t be generalized to other situations, it’s probably possible to replicate it, in a way that’s slightly different from running the same program with a different seed. (E.g. you could run the same algorithm on a different environment that was constructed to be the kind of environment that algorithm could solve.) So this wouldn’t work as a test to distinguish between useful results and non-useful results.
Relevant recent Andrew Gelman blog post