To that point, skew and excess Kurtosis are just two of an infinite number of moments, so obviously they do not characterize the distribution. As someone else here suggested, one can look at the Fourier (or other) Transform, but then you are again left with evaluating the difference between two functions or distributions: knowing that the FT of a Gaussian is a Gaussian in its dual space doesn’t help with “how close” a t-domain distribution F(t) is to a t-domain Gaussian G(t), you’ve just moved the problem into dual space.
We have a tendency to want to reduce an infinite dimensional question to a one dimensional answer. How about the L1 norm or the L2 norm of the difference? Well, the L2 norm is preserved under FT, so nothing is gained. Using the L1 norm would require some justification other than “it makes calculation easy”.
So it really boils down to what question you are asking, what difference does the difference (between some function and the Gaussian) make? If being wrong (F(t) != G(t) for some t) leads to a loss of money, then use this as the “loss” function. If it is lives saved or lost use that loss function on the space of distributions. All such loss functions will look like an integral over the domain of L(F(t), G(t)). In this framework, there is no universal answer, but once you’ve decided what your loss function is and what your tolerance is you can now compute how many approximations it takes to get your loss below your tolerance.
Another way of looking at it is to understand what we are trying to compare the closeness of the test distribution to. It is not enough to say F(t) is this close to the Gaussian unless you can also tell me what it is not. (This is the “define a cat” problem for elementary school kids.) Is it not close to a Laplace distribution? How far away from Laplace is your test distribution compared to how far away it is from Gaussian? For these kinds of questions—where you want to distinguish between two (or more) possible candidate distributions—the Likelihood ratio is a useful metric.
Most data sceancetists and machine learning smiths I’ve worked with assume that in “big data” everything is going to be a normal distribution “because Central Limit Theorem”. But they don’t stop to check that their final distribution is actually Gaussian (they just calculate the mean and the variance and make all sorts of parametric assumptions and p-value type interpretations based on some z-score), much less whether the process that is supposed to give rise to the final distribution is one of sampling repeatedly from different distributions or can be genuinely modeled as convolutions.
One example: the distribution of coefficients in a Logistic model is assumed (by all I’ve spoken to) to be Gaussian (“It is peaked in the middle and tails off to the ends.”). Analysis shows it to be closer to Laplace, and one can model the regression process itself as a diffusion equation in one dimension, whose solution is … Laplace!
I can provide an additional example, this time of a sampling process, where one is sampling from hundreds of distributions of different sizes (or weights), most of which are close to Gaussian. The distribution of the sum is once again, Laplace! With the right assumptions, one can mathematically show how you get Laplace from Gaussians.
To that point, skew and excess Kurtosis are just two of an infinite number of moments, so obviously they do not characterize the distribution. As someone else here suggested, one can look at the Fourier (or other) Transform, but then you are again left with evaluating the difference between two functions or distributions: knowing that the FT of a Gaussian is a Gaussian in its dual space doesn’t help with “how close” a t-domain distribution F(t) is to a t-domain Gaussian G(t), you’ve just moved the problem into dual space.
We have a tendency to want to reduce an infinite dimensional question to a one dimensional answer. How about the L1 norm or the L2 norm of the difference? Well, the L2 norm is preserved under FT, so nothing is gained. Using the L1 norm would require some justification other than “it makes calculation easy”.
So it really boils down to what question you are asking, what difference does the difference (between some function and the Gaussian) make? If being wrong (F(t) != G(t) for some t) leads to a loss of money, then use this as the “loss” function. If it is lives saved or lost use that loss function on the space of distributions. All such loss functions will look like an integral over the domain of L(F(t), G(t)). In this framework, there is no universal answer, but once you’ve decided what your loss function is and what your tolerance is you can now compute how many approximations it takes to get your loss below your tolerance.
Another way of looking at it is to understand what we are trying to compare the closeness of the test distribution to. It is not enough to say F(t) is this close to the Gaussian unless you can also tell me what it is not. (This is the “define a cat” problem for elementary school kids.) Is it not close to a Laplace distribution? How far away from Laplace is your test distribution compared to how far away it is from Gaussian? For these kinds of questions—where you want to distinguish between two (or more) possible candidate distributions—the Likelihood ratio is a useful metric.
Most data sceancetists and machine learning smiths I’ve worked with assume that in “big data” everything is going to be a normal distribution “because Central Limit Theorem”. But they don’t stop to check that their final distribution is actually Gaussian (they just calculate the mean and the variance and make all sorts of parametric assumptions and p-value type interpretations based on some z-score), much less whether the process that is supposed to give rise to the final distribution is one of sampling repeatedly from different distributions or can be genuinely modeled as convolutions.
One example: the distribution of coefficients in a Logistic model is assumed (by all I’ve spoken to) to be Gaussian (“It is peaked in the middle and tails off to the ends.”). Analysis shows it to be closer to Laplace, and one can model the regression process itself as a diffusion equation in one dimension, whose solution is … Laplace!
I can provide an additional example, this time of a sampling process, where one is sampling from hundreds of distributions of different sizes (or weights), most of which are close to Gaussian. The distribution of the sum is once again, Laplace! With the right assumptions, one can mathematically show how you get Laplace from Gaussians.
Thank you, that provided a lot of additional details.
I was interested in visual closeness and I think sum of abs delta would be a good fit. That doesn’t invalidate any of your points.
Actually, I’m very interested in these conditions. Can you elaborate?