TLW comments on A compilation of misuses of statistics

TLW 15 Feb 2022 4:16 UTC
3 points
One issue that I personally see a fair amount is people not grokking what happens when you have multiple superimposed probability distributions.
If I have two normal distributions superimposed, one with mean=0 and high variance, and one with mean != 0 and lower variance, the high-variance distribution will account for the majority of the outliers in both directions. (And the vast majority of extreme outliers in both directions. The tails are exponential; the ratio of two exponentials with different bases is itself an exponential, and drops towards zero surprisingly quickly.)
This can cause issues in a bunch of ways:
1. If you’re focusing on outliers, you can miss that the low-variance distribution exists at all.
2. If you’re defocusing / ignoring outliers, you’re affecting the high-variance distribution more than the low-variance distribution.
3. Things can become weird when you combine more complex distributions.
*****
Three pet peeves of mine—all of which you touched upon so kudos—are:
1. When people take “study X failed to replicate” as refuting X, ignoring that the replication had terrible statistical power.
2. When people take “study X didn’t show significance” as implying that all subsets of X must be insignificant.
3. When people take “study X showed significance; study Y failed to replicate X” as implying one or more of the following:
  1. At least one of X or Y were faked.
  2. At least one of X or Y were incorrect.
  3. X is insignificant.
  4. There were no important differences between the methodology of studies X and Y.
*****
I really wish that people wouldn’t use the normal distribution as a default. Start with the assumption that it’s a fat-tailed distribution, and if/when you have the data to show that it isn’t then go back to a normal distribution.
Alas, statistics/assumptions based on Gaussians is what gets published & publicized everywhere, so that’s what people use.
- Younes Kamel 16 Feb 2022 14:51 UTC
  2 points
  Parent
  100% agree with defaulting to non-gaussian distribution. That is what rigorous statistics would look like imo.