I played around with a small python script to see what happens in slightly more complicated settings.
Simple observations:
no matter the distribution, if noise has a fatter tail / is bigger than signal, you’re screwed (you can’t trust the top studies at all);
no matter the distribution, if signal has a fatter tail / is bigger than noise, you’re in business (you can trust the top studies);
in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
if noise amount is correlated with signal in a simple proportional way, then you’re in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)
This would mean the only critical information is often “is noise bigger than signal—in particular around the tails”. If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.
The practical takeaways are:
ignore everything that has probably higher noise than signal
take seriously everything that has probably bigger signal than noise and don’t bother with corrective terms
If you’re interested, “When is Goodhart catastrophic?” characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.
The downside is that because it doesn’t make assumptions about the distributions (other than independence), it sadly can’t say much about the non-limiting cases.
I played around with a small python script to see what happens in slightly more complicated settings.
Simple observations:
no matter the distribution, if noise has a fatter tail / is bigger than signal, you’re screwed (you can’t trust the top studies at all);
no matter the distribution, if signal has a fatter tail / is bigger than noise, you’re in business (you can trust the top studies);
in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
if noise amount is correlated with signal in a simple proportional way, then you’re in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)
This would mean the only critical information is often “is noise bigger than signal—in particular around the tails”. If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.
The practical takeaways are:
ignore everything that has probably higher noise than signal
take seriously everything that has probably bigger signal than noise and don’t bother with corrective terms
If you’re interested, “When is Goodhart catastrophic?” characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.
The downside is that because it doesn’t make assumptions about the distributions (other than independence), it sadly can’t say much about the non-limiting cases.