One issue that I personally see a fair amount is people not grokking what happens when you have multiple superimposed probability distributions.
If I have two normal distributions superimposed, one with mean=0 and high variance, and one with mean != 0 and lower variance, the high-variance distribution will account for the majority of the outliers in both directions. (And the vast majority of extreme outliers in both directions. The tails are exponential; the ratio of two exponentials with different bases is itself an exponential, and drops towards zero surprisingly quickly.)
This can cause issues in a bunch of ways:
If you’re focusing on outliers, you can miss that the low-variance distribution exists at all.
If you’re defocusing / ignoring outliers, you’re affecting the high-variance distribution more than the low-variance distribution.
Things can become weird when you combine more complex distributions.
*****
Three pet peeves of mine—all of which you touched upon so kudos—are:
When people take “study X failed to replicate” as refuting X, ignoring that the replication had terrible statistical power.
When people take “study X didn’t show significance” as implying that all subsets of X must be insignificant.
When people take “study X showed significance; study Y failed to replicate X” as implying one or more of the following:
At least one of X or Y were faked.
At least one of X or Y were incorrect.
X is insignificant.
There were no important differences between the methodology of studies X and Y.
*****
I really wish that people wouldn’t use the normal distribution as a default. Start with the assumption that it’s a fat-tailed distribution, and if/when you have the data to show that it isn’t then go back to a normal distribution.
Alas, statistics/assumptions based on Gaussians is what gets published & publicized everywhere, so that’s what people use.
One issue that I personally see a fair amount is people not grokking what happens when you have multiple superimposed probability distributions.
If I have two normal distributions superimposed, one with mean=0 and high variance, and one with mean != 0 and lower variance, the high-variance distribution will account for the majority of the outliers in both directions. (And the vast majority of extreme outliers in both directions. The tails are exponential; the ratio of two exponentials with different bases is itself an exponential, and drops towards zero surprisingly quickly.)
This can cause issues in a bunch of ways:
If you’re focusing on outliers, you can miss that the low-variance distribution exists at all.
If you’re defocusing / ignoring outliers, you’re affecting the high-variance distribution more than the low-variance distribution.
Things can become weird when you combine more complex distributions.
*****
Three pet peeves of mine—all of which you touched upon so kudos—are:
When people take “study X failed to replicate” as refuting X, ignoring that the replication had terrible statistical power.
When people take “study X didn’t show significance” as implying that all subsets of X must be insignificant.
When people take “study X showed significance; study Y failed to replicate X” as implying one or more of the following:
At least one of X or Y were faked.
At least one of X or Y were incorrect.
X is insignificant.
There were no important differences between the methodology of studies X and Y.
*****
I really wish that people wouldn’t use the normal distribution as a default. Start with the assumption that it’s a fat-tailed distribution, and if/when you have the data to show that it isn’t then go back to a normal distribution.
Alas, statistics/assumptions based on Gaussians is what gets published & publicized everywhere, so that’s what people use.
100% agree with defaulting to non-gaussian distribution. That is what rigorous statistics would look like imo.